cs712: interlingua based mt - iit bombay interlingua based mt ... (marathi-hindi-english: case...
TRANSCRIPT
CS712 Interlingua Based MT
Pushpak BhattacharyyaCSE Department
IIT Bombay28113
1
Empiricism vs Rationalism
bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do
with itndash 1950s Empiricism (Shannon Skinner Firth
Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp
T)ndash 2010s Return of Rationalism
Introductionbull Machine Translation (MT) is a technique to translate
texts from one natural language to another natural language using a machine
bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target
languagebull Translation between distant languages is a difficult
taskndash Handling Language Divergence is a major
challenge
3
Kinds of MT Systems(point of entry from source to the target text)
Deep understanding level
Interlingual level
Logico-semantic level
Syntactico-functional level
Morpho-syntactic level
Syntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semanticamp predicate-argument)
F-structures (functional)
C-structures (constituent)
Tagged text
Text
Mixing levels Multilevel description
Semi-direct translation
4
Why is MT difficult Language Divergence
bull One of the main complexities of MT Language Divergence
bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence
5
English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Empiricism vs Rationalism
bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do
with itndash 1950s Empiricism (Shannon Skinner Firth
Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp
T)ndash 2010s Return of Rationalism
Introductionbull Machine Translation (MT) is a technique to translate
texts from one natural language to another natural language using a machine
bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target
languagebull Translation between distant languages is a difficult
taskndash Handling Language Divergence is a major
challenge
3
Kinds of MT Systems(point of entry from source to the target text)
Deep understanding level
Interlingual level
Logico-semantic level
Syntactico-functional level
Morpho-syntactic level
Syntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semanticamp predicate-argument)
F-structures (functional)
C-structures (constituent)
Tagged text
Text
Mixing levels Multilevel description
Semi-direct translation
4
Why is MT difficult Language Divergence
bull One of the main complexities of MT Language Divergence
bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence
5
English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Introductionbull Machine Translation (MT) is a technique to translate
texts from one natural language to another natural language using a machine
bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target
languagebull Translation between distant languages is a difficult
taskndash Handling Language Divergence is a major
challenge
3
Kinds of MT Systems(point of entry from source to the target text)
Deep understanding level
Interlingual level
Logico-semantic level
Syntactico-functional level
Morpho-syntactic level
Syntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semanticamp predicate-argument)
F-structures (functional)
C-structures (constituent)
Tagged text
Text
Mixing levels Multilevel description
Semi-direct translation
4
Why is MT difficult Language Divergence
bull One of the main complexities of MT Language Divergence
bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence
5
English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Kinds of MT Systems(point of entry from source to the target text)
Deep understanding level
Interlingual level
Logico-semantic level
Syntactico-functional level
Morpho-syntactic level
Syntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semanticamp predicate-argument)
F-structures (functional)
C-structures (constituent)
Tagged text
Text
Mixing levels Multilevel description
Semi-direct translation
4
Why is MT difficult Language Divergence
bull One of the main complexities of MT Language Divergence
bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence
5
English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Why is MT difficult Language Divergence
bull One of the main complexities of MT Language Divergence
bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence
5
English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Divergence Theory Lexico-Semantic Divergences
bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan
bull Structural divergencendash E SVO H SOV
bull Categorial divergencendash Change is in POS category (many examples
discussed)
6
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Divergence Theory Lexico-Semantic Divergences (cntd)
bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan
mantrii (India-of Prime Minister)bull Lexical divergence
ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon
7
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Divergence Theory Syntactic Divergences
bull Constituent Order divergencendash E Singh the PM of India will address the nation
today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)
bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa
garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence
ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)
8
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Divergence Theory Syntactic Divergences
bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)
bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain
happening is no translation of it)
9
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Divergence exists even for close cousins
(Marathi-Hindi-English case marking and postpositions do not transfer easily)
bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)
bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei
bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)
10
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Interlingua Based MT
11
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
KBMT
bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989
bull Forerunner of many interlingua based MT efforts
12
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
IL based MT schematic
13
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Specification of KBMT
bull Source languages English and Japanese
bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A
distributed coarsely parallel systembull Subworld (domain) of translation
personal computer installation and maintenance manuals
14
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Knowledge base of KBMT (12)
bull An ontology (domain model) of about 1500 concepts
bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English
bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English
15
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Knowledge base of KBMT (22)
bull Analysis grammars for English and Japanese
bull Generation grammars for English and Japanese
bull Specialized syntax-semantics structural mapping rules
16
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Underlying formlisms
bull The knowledge representation system FrameKit
bull A language for representing domain models (a semantic extension of FRAMEKIT)
bull Specialized grammar formalisms based on Lexical-Functional Grammar
bull A specially constructed language for representing text meanings (the interlingua)
bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Procedural components
bull A syntactic parser with a semantic constraint interpreter
bull A semantic mapper for treating additional types of semantic constraints
bull An interactive augmentor for treating residual ambiguities
18
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Architecture of KBMT
19
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Digression language typology
20
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language and dialects
bull There are 6909 living languages (2009 census)
bull Dialects far outnumber the languagesbull Language varieties are called dialects
ndash if they have no standard or codified formndash if the speakers of the given language do not have a
state of their ownndash if they are rarely or never used in writing (outside
reported speech)ndash if they lack prestige with respect to some other often
standardised variety 22
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
ldquoLinguisticrdquo interlingua
bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary
Language Association)ndash ldquoEsperantordquondash ldquoIdordquo
23
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
24
The Lords Prayer in Esperanto Interlingua version Latin version English version
(traditional)
1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen
1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen
1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen
1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
ldquoInterlinguardquo and ldquoEsperantordquo
bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian
bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using
heavy agglutinationbull Esperanto word for hospitalldquo
malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)
25
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
26
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Language Universals vs Language Typology
bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other
27
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology basic word order
bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)
ldquoSubjects tend strongly to precede objectsrdquo
28
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology Morphotactics of singular and plural
bull No expression Japanese hito person pl hito
bull Function word Tagalog bato stone pl mga bato
bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto
bull Sound change English man pl men Arabic rajulun man pl rijalun
bull Reduplication Malay anak child pl anak-anak
29
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology Implication of word order
Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or
noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than
Tom)
30
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology motion verbs (13)
bull Motion is expressed differently in different languages and the differences turn out to be highly significant
bull There are two large typesndash verb-framed and ndash satellite-framed languages
31
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology motion verbs (23)
bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)
32
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Typology motion verbs (33)
bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element
bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly
bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST
33
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
A look at annotation
34
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)
bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators
bull Usually the information is added by many small individual decisions in many places throughout the data
bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier
35
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Definition etc (12)
bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of
informationndash chosen from a fixed set of options
36
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Example of annotation sense marking
एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह
(According to a new research those people who have a busy social life have larger space in a part of their brain)
नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक
यि त लोग क हत म काम करना चा हए ndash (English synset)
multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people
bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in
general considered as a whole he is a hero in the eyes of the publicrdquo
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Sense Marked corpora in Marathi
Snapshot of a Marathi sense tagged paragraph
39
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
A good annotator is rare to find
bull An annotator has to understand BOTH language phenomena and the data
bull An annotation designer has to understand BOTH linguistics and statistics
40
Linguistics and Language phenomena
Data and statistical phenomena
Annotator
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
The completeness chain
bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete
41
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Penn tagset (12)
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Penn tagset (22)
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Indian Language Tagset Noun
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Indian Language Tagset Verb Adjective Adverb
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Indian Language Tagset Particle
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Scale of effort involved in annotation Penn Treebank (12)
bull Wall Street Journal subcorpus of the PTB 2 release
bull 8 people working half time each about a year
bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version
47
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Scale of effort involved in annotation Penn Treebank (22)
bull The same group also ndash annotated most of the original Brown corpus with PTB
2 style treesndash 1 million words of the Switchboard corpus for total of
well over 3 million wordsndash POS tagged 3 million words of a
much larger WSJ corpusbull All in all the effort was over about 5 years and had
about 4-5 full time employeesbull Total effort 8 million words 20-25 man years
48
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Annotation effort Ontonotesbull Annotate 300K words per year
ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)
ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument
structure) and ndash shallow semantics (word sense linked to an ontology and
coreference)
bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project
bull Ontonotes annotation guidelines were frozen a priori 49
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Annotation effort Prague Discourse Treebank (12)
bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)
bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme
50
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Annotation effort Prague Discourse Treebank (22)
bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)
bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc
51
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Back to Interlingua
52
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
MT EnConverion + Deconversion
Interlingua(UNL)
English
French
Hindi
Chinese
generation
Analysis
53
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence
Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity
problems)bull Current active language groups
ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)
55
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
56
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Japanese
Hindi
Spanish
bull Language independent meaning representation
Marathi
Others
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
UNL represents knowledge John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
57
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Sentence embeddings
Deepa claimed that she had composed a poem[UNL]
agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete
poemindef)[UNL]
58
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
59
The Lexicon
Content words
[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt
[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt
[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt
Headword Universal Word Attributes
He forwarded the mail to the minister
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
60
The Lexicon (cntd)
function words
[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt
[the] ldquotherdquo (ARTTHE) ltE00gt
[to] ldquotordquo (PRETO) ltE00gt
Headword Universal Word
Attributes
He forwarded the mail to the minister
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
How to obtain UNL expresssions
bull UNL nerve center is the verbbull English sentences
ndash A ltverbgt B orndash A ltverbgt
61
verb
BA
A
verb
rel relrel
Recurse on A and B
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Enconversion
Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya
Gajanan RajitaMany publications
62
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
System Architecture
Stanford Dependency
Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple SentenceEnconverter
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
SimpleEnco
Simplifier
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Complexity of handling long sentences
Problem As the length (number of words)
increases in the sentence the XLE parser fails to capture the dependency relations precisely
The UNL enconversion system is tightly coupled with XLE parser
Resulting in fall of accuracy
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Solution
bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal
levels Simplificationndash Once broken they needs to be joined
Merging
bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute
generation are formed on Stanford Dependency relations
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Text Simplifierbull It simplifies a complex or compound sentence
bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school
bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
UNL Mergingbull Merger takes multiple UNLs and merges them to form one
UNL
bull Example-ndash Input
bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)
ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)
bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
NLP tools and resources for UNL generation
Tools
bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser
Resource
bull Wordnetbull Universal Word Dictionary(UW++)
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
System Processing UnitsSyntactic
Processing
bull NER
bull Stanford Dependency Parser
bull XLE Parser
Semantic Processing
bull Stems of wordsbull WSDbull Noun originating
from verbsbull Feature
Generationbull Noun featuresbull Verb features
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
SYNTACTIC PROCESSING
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntactic Processing NERbull NER tags the named entities in the sentence with
their types
bull Types coveredndash Personndash Placendash Organization
bull Input Will Smith was eating an apple with a fork on the bank of the river
bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntactic Processing Stanford Dependency Parser (12)
bull Stanford Dependency Parser parses the sentence and produces
ndash POS tags of words in the sentence
ndash Dependency parse of the sentence
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntactic Processing Stanford Dependency Parser (22)
Will-Smith was eating an apple with a fork on the bank of the river
Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN
POS tags
nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)
Dependency Relations
Input
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntactic Processing XLE Parser
bull XLE parser generates dependency relations and some other important information
lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)
Dependency Relations
PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)
ImportantInformation
Will-Smith was eating an apple with a fork on the bank of the riverInput
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
SEMANTIC PROCESSING
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing Finding stems
bull Wordnet is used for finding the stems of the words
Word Stemwas beeating eatapple applefork forkbank bankriver river
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing WSDWord Synset-id Sense-id
Will-Smith - -
was 02610777 be24203
eating 01170802 eat23400
an - -
apple 07755101 apple11300
with - -
a - -
fork 03388794 fork10600
on - -
the - -
bank 09236472 bank11701
of - -
the - -
river 09434308 river11700
WSD
Word POS
Will-Smith
Proper noun
was Verb
eating Verb
an Determiner
apple Noun
with Preposition
a Determiner
fork Noun
on Preposition
the Determiner
bank Noun
of Preposition
the Determiner
river Noun
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing Universal Word Generation
bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)
Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing Nouns originating from verbs
bull Some nouns are originated from verbs
bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2
Traverse the hypernym-path of the noun to its root
Any word on the path is ldquoactionrdquo then the noun has originated from a verb
Algorithm removal of garbage =
removing garbage Collection of stamps =
collecting stamps Wastage of food =
wasting food
Examples
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Features of UWsbull Each UNL relation requires some properties of their
UWs to be satisfied
bull We capture these properties as features
bull We classify the features into Noun and Verb Features
Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing Noun Features
bull Add wordNE_type to wordfeatures
bull Traverse the hypernym-path of the noun to its root
bull Any word on the path matches a keyword corresponding feature generated
Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE
Algorithm
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Types of verbsbull Verb features are based on the types of
verbs
bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or
just happen without any performerndash Be Verbs which tells about existence of
something or about some fact
bull We classify these three into do and not-do
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Semantic Processing Verb Features
bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not
bull ldquodordquo verbs are volitional verbs which are performed voluntarily
bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type
bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo
Word Type Featurewas not do -eating do do
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
GENERATION OF RELATIONS AND ATTRIBUTES
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Relation Generation (12)
bull Relations handled separately
bull Similar relations handled in cluster
bull Rule based approach
bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Relation Generation (22)
Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active
agt(eat Will Smith)
dobj(eating apple)eat is active
obj(eat apple)
prep(eating with) pobj(with fork)forkfeature=CONCRETE
ins(eat fork)
prep(eating on) pobj(on bank)bankfeature=PLACE
plc(eat bank)
prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON
mod(bank river)
Will-Smith was eating an apple with a fork on the bank of the riverInput
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Attribute Generation (12)
bull Attributes handled in clusters
bull Attributes are generated by rules
bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Attribute Generation (22)
Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)
eatentryprogresspast
det(apple an) appleindef
det(fork a) forkindef
det(bank the) bankdef
det(river the) riverdef
Will-Smith was eating an apple with a fork on the bank of the riverInput
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
EVALUATION OF OUR SYSTEM
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Resultsbull Evaluated on EOLSS corpus
ndash 7000 sentences from EOLSS corpus
bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford
parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser
bull Scenario3 lets us know the impact of Simplifier and Merger
bull Scenario4 lets us know the impact of XLE parser
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)
bull tgen = tgold if ndash relationgen = relationgold
ndash UW1gen = UW1gold
ndash UW2gen = UW2gold
bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold
bull count(UWx)= attributes in UWx
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Relation-wise accuracy
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Overall accuracy
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Relation
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Relation-wise Precision
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Relation-wise Recall
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Relation-wise F-score
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Attribute
0
005
01
015
02
025
03
035
04
045
05
Scenario1 Scenario2 Scenario3 Scenario4
Attribute Accuracy
Accuracy
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results Time-taken
bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus
0
5
10
15
20
25
30
35
40
Scenario1 Scenario2 Scenario3 Scenario4
Time taken
Time taken
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
ERROR ANALYSIS
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Wrong Relations Generated
0
5000
10000
15000
20000
25000
30000
35000
obj mod aoj man and agt plc qua pur gol tim or overall
False positive
False positive
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Missed Relations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
obj mod aoj man and agt plc qua pur gol tim or overall
False negative
False negative
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Case Study (13)
bull Conjunctive relationsndash Wrong apposition detection
bull James loves eating red water-melon apples peeled nicely and bananas
amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Case Study (23)bull Conjunctive relations
ndash Non-uniform relation generationbull George the president of the football
club James the coach of the club and the players went to watch their rivals game
appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Case Study (33)bull Low Precision frequent relations
ndash mod and quabull More general rules
ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation
possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms
agt(represent latter)aoj(represent former)obj(dosage body)
Corpus Relations
aoj(represent latter)aoj(represent former)gol(dosage body)
Generated Relations
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Hindi Generation from Interlingua (UNL)
(Joint work with S Singh M Dalal V Vachhani Om Damani
MT Summit 2007)
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Step-through the Deconverter Module Output
UNL Expression
obj(contact(hellip
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
you
this
farmer
agtobj
pur
plc
contact
nam
orregion
khatav
manchar
taluka
nam01
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Lexeme Selection
[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)
[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)
Lexical Choice is unambiguous
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N
bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers
you आपthis यह
farmer कसान
agtobj contact सपक
pur
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Morphology Generation Nouns
Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique
The boy saw meलड़क न मझ दखा
Boys saw meलड़क न मझ दखा
The King saw meराजा न मझ दखा
Kings saw meराजाऒ न मझ दखा
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Verb Morphology
Suffix Tense Aspect Mood N Gen P VE
-e rahaa thaa
past progress - sg male 3rd e
-taa hai present custom - sg male 3rd -
-iyaa thaa past complete - sg male 3rd I
saktii hain present - ability pl female 3rd A
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
After Morphology Generation
you आपthis यह
farmer कसान
agtobj contact सपक
pur
you आपthis इस
farmer कसान
agtobj contact सपक क िजए
pur
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव
you आप
this इसकलए
farmer कसान को
agtobj contact सपक क िजए
pur
Rel Par+ Par- Chi+ Chi- ChFW
Obj V VINT NANIMT topic को
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Linearization
इस आप मचर This you manchar region
खटाव तालक कसान khatav taluka farmers
सपक contact
you
this
farmer
agtobj
pur
plc
Contact सपक क िजए
nam
orregion
khatav
manchar
taluka
nam
01
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntax Planning Assumptions
bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the
semantic properties of the relata
ndash Context Independence the rest of the expression
bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the
expression
this
contact
pur
you
this
farmer
agtobj
contact
pur
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
agt aoj obj ins Agt L L L Aoj L L Obj R Ins
you
this
farmer
agtobj contact
pur
Syntax Planning Strategy
bull Divide a nodes relations in Before_set = objpuragtAfter_set =
obj pur
agt
Topo Sort each group pur agt objthis you farmer
Final order this you farmer contact
region
khatav
manchtaluka
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Syntax Planning AlgoStack Before
CurrentAfterCurrent
Output
regionfarmercontact
this you
manchar taluka
mancharregiontalukafarmercontact
regiontalukafarmerContact
this youmanchar
talukafarmercontact
this youmancharregion
you
this
farmer
agtobj
pur
plc
Contact
nam
orregion
khatav
manchar
taluka
nam01
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
All Together (UNL -gt Hindi)
Module Output UNL Expression
obj(contact(
Lexeme Selection
सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav
Case Identification
सपक कसान यह आप तालक मचर खटाव
contact farmer this you region taluka manchar khatav
Morphology Generation
सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav
Function Word Insertion
सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav
Syntax Linearization
इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
How to Evaluate UNL Deconversion
bull UNL -gt Hindibull Reference Translation needs to be generated from
UNLndash Needs expertise
bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect
bull Note that fidelity is not an issue
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Manual Evaluation Guidelines
Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible
Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Sample Translations
Hindi Output Fluency Adequacy
कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए
2 3
पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए
3 4
हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता
1 1
जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4
इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह
2 3
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Results BLEU Fluency Adequacy
Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050
Pearson Cor Fluency 059 100 068
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Num
ber o
f se
nten
ces
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone
Caution Domain diversity Speaker diversity
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124
Summary interlingua based MT
bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and
tools
124