semi-automatic annotation of the romanian timebank 1.2 calp07 workshop @ ranlp 1 semi-automatic...

of 39 /39
Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 1 Semi-automatic Annotation Semi-automatic Annotation of the Romanian TimeBank of the Romanian TimeBank 1.2 1.2 Corina Forăscu Corina Forăscu , Radu Ion, Dan Tufi , Radu Ion, Dan Tufi ş ş Faculty of Computer Science, Al.I. Cuza Faculty of Computer Science, Al.I. Cuza University of Ia University of Ia s s i, Romania i, Romania & & Research Institute for Artificial Intelligence Research Institute for Artificial Intelligence of the Romanian Academy of the Romanian Academy [email protected] [email protected] , , {radu {radu , , tufis}@racai.ro tufis}@racai.ro

Post on 22-Dec-2015

220 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forscu, Radu Ion, Dan Tufi Faculty of Computer Science, Al.I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy [email protected], {radu, tufis}@racai.ro [email protected]}@racai.ro [email protected]}@racai.ro
  • Slide 2
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 2 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions
  • Slide 3
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 3 Fundamentals Temporal information in Natural Language: 1. Time-denoting expressions references to a calendar or clock system expressed by NPs, PPs, or AdvPs expressed by NPs, PPs, or AdvPs the 23 rd of May, 1998; Monday; tomorrow; the second semester the 23 rd of May, 1998; Monday; tomorrow; the second semester 2. Event-denoting expressions - reference to an event expressed by 1. sentences more precisely their syntactic head, the main verb: John listens to the music. John listens to the music. 2. noun phrases: Israel will ask the USA to delay a military strike against Iraq. Israel will ask the USA to delay a military strike against Iraq.
  • Slide 4
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 4 Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; question answering (questions like when, how often or how long); information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); discourse processing: temporal structure of discourse and summarization.
  • Slide 5
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 5 Acum i ddea seama c tocmai din cauza acestui incident se hotrse el brusc s vin acas i s-i nceap jurnalul taman astzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today. Motivation (2)
  • Slide 6
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 6 Acum i ddea seama c tocmai din cauza acestui incident se hotrse el brusc s vin acas i s-i nceap jurnalul taman astzi. Acum si ddea seama ca tocmai din cauza acestui incident se hotarse el brusc sa vin acasa si sa -si nceap jurnalul taman astzi. Motivation (3)
  • Slide 7
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 7 State of the Art 1947Reichenbach: The tenses of verbs 1998MUC 7 2000TIMEX 2004ACE TERN: TIMEX2 v.1.1.TARSQI: TimeML v.1.2. 2005ACE TERN: TIMEX2 v.1.2.ACL 2005: TARSQI system ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events 2006Time Symposium ACL: Temporal and Spatial Information Processing2001STAG (Setzer)TIDES 2001: TIMEX2 v.1.0.2 LREC 2002 Annotation Standards for Temporal Information in Natural Language 2002DAML-TimeTERQAS: TimeML v.1.0.
  • Slide 8
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 8 TERQAS 2002 + TimeML v.1.0 metadata standard for: marking events, marking events, their temporal anchoring and their temporal anchoring and links in news articles links in news articles + TimeBank corpus v.1.0. + guidelines for temporal annotation
  • Slide 9
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 9 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions
  • Slide 10
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 10 TimeML v.1.2 A metadata standard developed especially for news articles, for marking Events: EVENT, MAKEINSTANCE Events: EVENT, MAKEINSTANCE temporal anchoring of events: TIMEX3, SIGNAL temporal anchoring of events: TIMEX3, SIGNAL links between events and/or timexes: TLINK, ALINK, SLINK links between events and/or timexes: TLINK, ALINK, SLINK
  • Slide 11
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 11 Events (1) situations that happen or occur, states or circumstances in which something obtains or holds true situations that happen or occur, states or circumstances in which something obtains or holds true tensed verbs, adjectives, nominalizations tensed verbs, adjectives, nominalizations The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. 7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_STATE, I_ACTION
  • Slide 12
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 12 Events (2) The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. Analysts say e28 much of Kellogg's erosion e204 has been in such core brands as Corn Flakes,...
  • Slide 13
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 13 Instances Based on the event annotation: how many different instances or realizations has a given event at least one Based on the event annotation: how many different instances or realizations has a given event at least one Carries the tense and aspect of the verb- denoted event Carries the tense and aspect of the verb- denoted event John learns e1 twice on Monday.
  • Slide 14
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 14 Temporal expressions: TIMEX3 (1) Explicit & implicit temporal expressions: Times: 11 oclock; midnight Times: 11 oclock; midnight Dates: Dates: Fully Specified (May 23, 2006; winter, 2005), Fully Specified (May 23, 2006; winter, 2005), Underspecified (Monday; next week; last month; two years ago) Underspecified (Monday; next week; last month; two years ago) Durations: two months; three hours Durations: two months; three hours Sets: every week; every Tuesday Sets: every week; every Tuesday
  • Slide 15
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 15 Temporal expressions: TIMEX3 (2) 10/30/89 10/30/89 the next two years or so the next two years or so soon soon
  • Slide 16
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 16 Temporal signals: SIGNAL Function words that indicate how temporal objects are to be related to each other: temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when negative expressions negative expressions modal verbs modal verbs prepositions signaling modality (to) prepositions signaling modality (to) special characters denoting ranges in temporal expressions: - and / special characters denoting ranges in temporal expressions: - and /
  • Slide 17
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 17 Dependencies: LINK s Temporal Relations: TLINK Temporal Relations: TLINK Anchors to Time Anchors to Time Orders between Time and Events Orders between Time and Events Aspectual Relations: ALINK Aspectual Relations: ALINK Phases of an event Phases of an event Subordinating Relations: SLINK Subordinating Relations: SLINK Events that syntactically subordinate other events Events that syntactically subordinate other events
  • Slide 18
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 18 Temporal relations: TLINK (1) temporal relation between two temporal elements (event-event, event-timex); temporal relation between two temporal elements (event-event, event-timex); EVENT s through their INSTANCE s EVENT s through their INSTANCE s 13 relTypes as Allens: 13 relTypes as Allens: Simultaneous Simultaneous Identical Identical One before (/after) the other One before (/after) the other One immediately before (+after) the other One immediately before (+after) the other One including / being included in the other One including / being included in the other One holding during the duration of the other One holding during the duration of the other One being the beginning (/ending) of the other One being the beginning (/ending) of the other One being begun (/ended) by the other One being begun (/ended) by the other
  • Slide 19
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 19 Temporal relations: TLINK (2) The oat-bran craze e190/ei1994 has cost e189/ei1995 the world's largest cereal maker market share. The company's president quit e3 /ei1996 suddenly. crazecost 10/30/89 ei1994 ei1995t192 quit ei1996
  • Slide 20
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 20 Temporal relations: TLINK (3) crazecost 10/30/89 ei1994 ei1995t192 quit ei1996
  • Slide 21
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 21 Aspectual relations: ALINK relationship between an aspectual event and its argument event: relationship between an aspectual event and its argument event: Initiation: John started ei5 to read ei6. Initiation: John started ei5 to read ei6. Culmination : John finished ei5 assembling ei6 the table. Culmination : John finished ei5 assembling ei6 the table. Termination: John stopped talking. Termination: John stopped talking. Continuation : John kept talking. Continuation : John kept talking.
  • Slide 22
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 22 Subordination relations: SLINK for contexts introducing relations between two events of type: for contexts introducing relations between two events of type: Modal: John should have bought some wine. Modal: John should have bought some wine. Factive: John forgot that he was in Boston yesterday. Factive: John forgot that he was in Boston yesterday. Counterfactive: John prevented the divorce. Counterfactive: John prevented the divorce. Evidential: John said he bought some wine. Evidential: John said he bought some wine. Negative evidential: John denied he bought only beer. Negative evidential: John denied he bought only beer. Conditional: If John leaves today, Mary will cry. Conditional: If John leaves today, Mary will cry.
  • Slide 23
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 23 TimeBank 1.2 183 English news report documents TimeML annotated, distributed through LDC 4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units Non-TimeML Markup in Time Bank 1.1: structure information: header named entity recognition:,, sentence boundary information:
  • Slide 24
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 24 TimeBank 1.2 events 7935 instances 7940 timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL27592
  • Slide 25
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 25 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions
  • Slide 26
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 26 Translation 2 trained translators; one final correction Translation desiderata: 1-1 sentence aligned Preserving POS Verb tense mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian 4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types
  • Slide 27
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 27 Preprocessing the corpus Tokenisation MtSeg, with idiomatic expressions, clitic splitting POS-tagging TnT adapted & improved to determine the POS of unknown words Lemmatisation probabilistic, based on a lexicon Chunking REs over POS tags to determine non-recursive NPs, APs, AdvPs, PPs
  • Slide 28
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 28 Alignment : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 Current: P = 88.80%, R = 74.83%, F = 81.22% 91714 alignments, manually checked, out of which 25346 are NULL-alignments
  • Slide 29
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 29 Alignment 1. Content words alignment: based on the translation lexicons P = 94.08%, R = 34.99%, F = 51.00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89.90%, R = 53.90%, F = 67.40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed.
  • Slide 30
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 30 Alignment
  • Slide 31
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 31 Alignment The parallel corpus = 183 files in XCES format Pe_de_alt_parte, se dovedete a fi alt sptmn financiar foarte proast On_the_other_hand, it 's turning out to be another very bad financial week
  • Slide 32
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 32 Annotation import Based on the Romanian-English lexical alignment
  • Slide 33
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 33 Annotation import For every pair of sentences Sro and Sen from the TimeBank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {,,,,, }.
  • Slide 34
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 34 Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English TimeBank. E = {,, } 3. construct the list RW of Romanian words along with the transferred XML contexts using E and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {,, }.
  • Slide 35
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 35 Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format.
  • Slide 36
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 36 Annotation import Offline markup ( MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian
  • Slide 37
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 37 Annotation import TimeML tags % transfered events770397.07 instances770697.05 timexes135695.89 signals66897.09 alinks24993.96 slinks283196.55 tlinks612295.38 TOTAL2663596.53
  • Slide 38
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 38 Conclusions & future work improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature)
  • Slide 39
  • Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 39 Thank you! (Temporal) Questions???