semi-automatic annotation of the romanian timebank 1.2

of 39 /39
Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 1 Semi-automatic Annotation Semi-automatic Annotation of the Romanian TimeBank of the Romanian TimeBank 1.2 1.2 Corina Forăscu Corina Forăscu , Radu Ion, Dan Tufi , Radu Ion, Dan Tufi ş ş Faculty of Computer Science, Al.I. Cuza Faculty of Computer Science, Al.I. Cuza University of Ia University of Ia s s i, Romania i, Romania & & Research Institute for Artificial Intelligence Research Institute for Artificial Intelligence of the Romanian Academy of the Romanian Academy [email protected] [email protected] , , {radu {radu , , tufis}@racai.ro tufis}@racai.ro

Author: misha

Post on 12-Jan-2016

48 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Semi-automatic Annotation of the Romanian TimeBank 1.2. Corina Forăscu , Radu Ion, Dan Tufi ş Faculty of Computer Science, Al.I. Cuza University of Ia s i, Romania & Research Institute for Artificial Intelligence of the Romanian Academy [email protected] , {radu , tufis}@racai.ro. - PowerPoint PPT Presentation

TRANSCRIPT

Temporal Information in Natural Languages or Is TIME in Romanian the same? CALP07 workshop @ RANLP
Corina Forscu, Radu Ion, Dan Tufi
&
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
Time-denoting expressions – references to a calendar or clock system
expressed by NPs, PPs, or AdvPs
the 23rd of May, 1998; Monday; tomorrow; the second semester
Event-denoting expressions - reference to an event
expressed by
John listens to the music.
noun phrases:
Israel will ask the USA to delay a military strike against Iraq.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
question answering (questions like when, how often or how long);
information extraction or information retrieval;
machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language);
discourse processing: temporal structure of discourse and summarization.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Motivation (2)
Acum îi ddea seama c tocmai din cauza acestui incident se hotrâse el brusc s vin acas i s-i înceap jurnalul taman astzi.
Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Motivation (3)
Acum îi ddea seama c tocmai din cauza acestui incident se hotrâse el brusc s vin acas i s-i înceap jurnalul taman astzi.
<TIMEX3 temporalFunction="true" tid="t152" type="TIME" value="PRESENT_REF">Acum</TIMEX3> îsi <EVENT aspect="PROGRESSIVE" class="OCCURENCE" eid="e153" tense="PAST">ddea</EVENT><MAKEINSTANCE eiid="ei59" eid="e153" cardinality="1" /> seama <SIGNAL sid="s154">ca</SIGNAL> tocmai din cauza acestui <EVENT aspect="NONE" class="OCCURENCE" eid="e156" tense="NONE">incident</EVENT> <MAKEINSTANCE eiid="ei60" eid="e156" cardinality="1" /> se <EVENT aspect="PERFECTIVE" class="I_ACTION" eid="e157" tense="PAST">hotarâse</EVENT><MAKEINSTANCE eiid="ei61" eid="e157" cardinality="1" /> el brusc <SIGNAL sid="s54">sa</SIGNAL><EVENT aspect="NONE" class="OCCURENCE" eid="e159" tense="PRESENT">vin</EVENT><MAKEINSTANCE eiid="ei62" eid="e159" cardinality="1" /> acasa <SIGNAL sid="s160">si</SIGNAL><SIGNAL sid="s55">sa</SIGNAL> -si <EVENT aspect="NONE" class="ASPECTUAL" eid="e161" tense="PRESENT">înceap</EVENT> <MAKEINSTANCE eiid="ei63" eid="e161" cardinality="1" /> jurnalul taman <TIMEX3 temporalFunction="true" tid="t162" type="DATE" value="1984-04-04">astzi</TIMEX3> .
  <TLINK eventInstanceID="ei59" relatedToTime="t152" relType="SIMULTANEOUS" />
  <TLINK eventInstanceID="ei60" relatedToEvent="e157" relType="BEFORE" />
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
1998
2006
TIDES 2001: TIMEX2 v.1.0.2
LREC 2002 Annotation Standards for Temporal Information in Natural Language
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
marking events,
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
A metadata standard developed especially for news articles, for marking
Events: EVENT, MAKEINSTANCE
links between events and/or timexes: TLINK, ALINK, SLINK
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Events (1)
situations that happen or occur, states or circumstances in which something obtains or holds true
tensed verbs, adjectives, nominalizations
The oat-bran craze e190 has cost e189 the world's largest cereal maker market share.
7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_STATE, I_ACTION
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Instances
Based on the event annotation: how many different instances or realizations has a given event – at least one
Carries the tense and aspect of the verb-denoted event
John learnse1 twice on Monday.
<MAKEINSTANCE eiid=‘ei1’ eventID=‘e1’ signalID=‘s1’ cardinality=‘2’ aspect="NONE" tense="PRESENT">
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
• Dates:
Underspecified (Monday; next week; last month; two years ago)
• Durations: two months; three hours
• Sets: every week; every Tuesday
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
<TIMEX3 mod="APPROX" tid="t220" type="DURATION" temporalFunction="true" functionInDocument="NONE" value="P2Y" anchorTimeID="t192" >the next two years or so</TIMEX3>
<TIMEX3 tid="t207" type="DATE" temporalFunction="true" functionInDocument="NONE" value="FUTURE_REF" anchorTimeID="t192" >soon</TIMEX3>
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Temporal signals: SIGNAL
Function words that indicate how temporal objects are to be related to each other:
temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when
negative expressions
modal verbs
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Aspectual Relations: ALINK
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
EVENTs – through their INSTANCEs
Simultaneous
Identical
One immediately before (+after) the other
One including / being included in the other
One holding during the duration of the other
One being the beginning (/ending) of the other
One being begun (/ended) by the other
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
The company's president quit e3 /ei1996 suddenly.
craze
cost
10/30/89
ei1994
ei1995
t192
quit
ei1996
CALP07 workshop @ RANLP
<TLINK relatedToTime="t192" eventInstanceID="ei1996" relType="BEFORE" />
<TLINK relatedToEventInstance="ei1995" eventInstanceID="ei1996" relType="IS_INCLUDED" />
craze
cost
10/30/89
ei1994
ei1995
t192
quit
ei1996
CALP07 workshop @ RANLP
Initiation: John started ei5 to read ei6.
<ALINK eventInstanceID="ei5" relatedToEventInstance="ei6" relType="INITIATES"/>
Culmination: John finished ei5 assembling ei6 the table.
<ALINK eventInstanceID="ei5“ relatedToEventInstance="ei6“ relType="TERMINATES"/>
Termination: John stopped talking.
Continuation: John kept talking.
CALP07 workshop @ RANLP
Modal: John should have bought some wine.
Factive: John forgot that he was in Boston yesterday.
Counterfactive: John prevented the divorce.
Evidential: John said he bought some wine.
Negative evidential: John denied he bought only beer.
Conditional: If John leaves today, Mary will cry.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
183 English news report documents TimeML annotated, distributed through LDC
4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units
Non-TimeML Markup in Time Bank 1.1:
structure information: header
sentence boundary information: <s>
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
Translation desiderata:
Verb tense – mapped onto Romanian
Format of the dates, moments of day and numbers conforms to the norms of written Romanian
4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Tokenisation – MtSeg, with idiomatic expressions, clitic splitting
POS-tagging – TnT adapted & improved to determine the POS of unknown words
Lemmatisation – probabilistic, based on a lexicon
Chunking – REs over POS tags to determine non-recursive NPs, APs, AdvPs, PPs
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Alignment
YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian-English track organized at ACL2005
Current: P = 88.80%, R = 74.83%, F = 81.22%
91714 alignments, manually checked, out of which 25346 are NULL-alignments
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
P = 94.08%, R = 34.99%, F = 51.00%.
2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks;
P = 89.90%, R = 53.90%, F = 67.40%
3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions
4. Correction phase: the wrong links introduced mainly in stage 3 are now removed.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
<tu id="1">
<seg lang="ro">
<s id="Timex.ro.1">
  <w lemma="pe_de_alt_parte" ana="14+,R" chunk="Ap#1">Pe_de_alt_parte</w>
  <c>,</c>
  <w lemma="sine" ana="12+,PXA" chunk="Vp#1">se</w>
  <w lemma="dovedi" ana="1+,V3" chunk="Vp#1">dovedete</w>
  <w lemma="a" ana="15+,QN" chunk="Vp#2">a</w>
  <w lemma="fi" ana="1+,VN" chunk="Vp#2">fi</w>
  <w lemma="alt" ana="22+,PI" chunk="Np#1">alt</w>
  <w lemma="sptmân" ana="1+,NSRN" chunk="Np#1">sptmân</w>
  …
  <w lemma="on_the_other_hand" ana="14+,ADVE" chunk="Ap#1">On_the_other_hand</w>
<c>,</c>
<w lemma="it" ana="13+,PPER3" chunk="Vp#1">it</w>
<w lemma="be" ana="3+,AUX3" chunk="Vp#1">'s</w>
  <w lemma="turn" ana="1+,PPRE" chunk="Vp#1">turning</w>
  <w lemma="out" ana="5+,PREP">out</w>
<w lemma="to" ana="15+,TO" chunk="Vp#2">to</w>
<w lemma="be" ana="1+,VINF" chunk="Vp#2">be</w>
<w lemma="another" ana="22+,PI">another</w>
  <w lemma="very" ana="14+,ADVE" chunk="Ap#2">very</w>
  <w lemma="bad" ana="1+,ADJE" chunk="Ap#2,Np#1">bad</w>
  <w lemma="financial" ana="1+,ADJE" chunk="Ap#2,Np#1">financial</w>
  <w lemma="week" ana="1+,NN" chunk="Np#1">week</w>
  … </s>
</seg>
CALP07 workshop @ RANLP
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Annotation import
For every pair of sentences Sro and Sen from the TimeBank parallel corpus with the Ten English equivalent sentence:
1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten.
E = {<”In the”; 1,2>, <”Philippines”; 3>, <”, a”; 4,5>, <”four”; 6>, <”year”; 7>, <”low .”; 8,9>}.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Annotation import
2. add to every element of E the XML context in which that text fragment appeared in the original English TimeBank.
E’ = {<”In the”; 1,2; s>, <”Philippines”; 3; s, ENAMEX>, …}
3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered.
RW = {<”În”; s>, <”Filipine”; s,ENAMEX>, …}.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Annotation import
4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format.
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
Annotation import
Offline markup (MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP
CALP07 workshop @ RANLP
adequacy of temporal theories to Romanian
(semi) automatically mark-up of the temporal information in Romanian texts (news + literature)
Semi-automatic Annotation of the Romanian TimeBank 1.2
CALP07 workshop @ RANLP