coreferencing treebank data using cesac annotating and analysing is in corpora of historical english...
TRANSCRIPT
Coreferencing Treebank data using CESAC
Annotating and analysing IS in corpora of historical EnglishBerlin, 13-14 November 2009
Overview• CESAC
- Goals
- Coreference types: operationalizing IS
- Input and output
• Inter-rater agreement
• Example
• Summary and conclusion
Coreferencing using CesacContents
CESAC goals• Overall goal
- Referring from any one constituent to any other constituent
• More specifically
- Source and destination: IP/phrase/node or DP/lexeme/endnode
- Attributes
1. Coreference type
2. Distance measure
3. (NP type: definite, indefinite, etc.)
4. (Animacy)
Coreferencing using CesacGoal of CESAC
CESAC coreference types: operationalizing IS• Two basic rules
- Do not omit possible coreference information
- A source should be linked to the nearest possible destination
• Labels encoding different forms of anaphoricity
- Identity
‘Jacqueline plays the cello. She is an amazing musician’
- Cross Speech
‘John said to Paul: “Why don’t you play the guitar?”’
- Inferred
‘Do you see that house? They say the kitchen is extremely spacious’
- World knowledge (separate category)
‘According to Burt Reynolds, all dogs go to heaven’
• Cross Speech > Identity > Inferred
• Encoding facts vs. encoding interpretations: objective data
Coreferencing using CesacGoal of CESAC
CESAC input 1• Standard Penn-Treebank format
- Collection of <nodes>
- Each <node> consists of
1. Brackets: (…)
2. Label: (NP …)
3. Other node: (NP (N …) )
4. Lexeme: (N man)
5. Possibly <lexeme>+<node>: (P to (NP him))
- Attributes in label
(NP-ACC (PRO^A hine))
- Extra-textual data in CODE nodes
(CODE <TEXT: +tyl+aste>)
Coreferencing using CesacThe input: Penn-Treebank
EndNode
Node
CESAC input 2
Coreferencing using CesacThe input: Penn-Treebank
( (CODE <T06080009600,11.4>) (IP-MAT (CONJ And)
(NP-NOM (D^N +t+at) (N^N folc)) (NP-ACC (PRO^A hine)) (ADVP-TMP (ADV^T +ta)) (PP (P mid) (NP-DAT (ADJ^D unasecgendlicre) (N^D wur+dmynte))) (PP (P to) (NP-DAT (N^D scipe))) (VBDI gel+addon) (. ,)) (ID coapollo,ApT:11.4.183))
( (IP-MAT (CONJ and) (NP-NOM (NR^N Apollonius)) (NP-ACC-1 (PRO^A hi)) (VBDI b+ad) (IP-INF (NP-ACC-SBJ *ICH*-1)
(QP-ACC (Q^A ealle)) (VB $gretan))) (ID coapollo,ApT:11.4.184))
matching brackets
Label
CESAC output 1• Penn-Treebank format
• Enriched with coreference information
- Source node ID
- Destination node ID
- Coreference type
- Coreference distance – derivable
• Destination node example
(NP-SBJ (CODE <Coref_Id="339"_/>) (NPR Crist))
• Source node example
(NP-OB1
(CODE <Coref_Id="20"_Ref="21"_Type="Identity"_NdDist="16"_/>)
(PRO hem) )
Coreferencing using Cesacenriched Penn-Treebank
CESAC output 2
Coreferencing using Cesacenriched Penn-Treebank
NP-SBJ
NPR
Crist
NP-SBJ
CODE
<Coref Id=“310”>
NPR
Crist
Destination node
NP-OB1
PRO
hem
NP-OB1
CODE
<Coref Id=“340” Ref=“310” Type=“Identity>
PRO
hem
Source node
<node> = one-or-more <node>OR <lexeme>
<lexeme> + <node>
CESAC output 3
Coreferencing using Cesacenriched Penn-Treebank
( (IP-MAT (CONJ and) (NP-NOM *con* (CODE <Coref_Id="1488"_Ref="1489"_Type="Identity"_NdDist="12"_/>)) (VBD l+adde) (NP-ACC (CODE <Coref_Id="1476"_Ref="1477"_Type="Identity"_NdDist="8"_/>) (PRO^A hine)) (PP (P mid) (NP-DAT-RFL
(CODE <Coref_Id="1487"_Ref="1488"_Type="Identity"_NdDist="6"_/>) (PRO^D him)))
(PP (P to) (NP-DAT
(PRO$ his (CODE <Coref_Id="1486"_Ref="1487"_Type="Identity"_NdDist="5"_/>))
(N^D huse)))) (ID coapollo,ApT:12.16.209))
CESAC output 4
Coreferencing using Cesacenriched Penn-Treebank
Source nodeNP-DAT
PRO$
his
PP
P
to
CODE
<Coref Id=“20” Ref=“21” Type=“Identity>
NP-DAT
PRO$
his
PP
P
to
<node> = one-or-more <node>OR <node> <lexeme>OR <lexeme>
Inter-rater agreement 1• Two features measured
- Coreference destination (node ID)
- Coreference type
• Adapted version of Cohen’s kappa: κ > .6
• Two important problems
- Identity vs. cross speech
- Omission of link
• Solutions
- Create new rule(s)
- Adapt/specify existing rule(s)
Coreferencing using CesacGoal of CESAC
Inter-rater agreement 2• Tool used to calculate inter-rater agreement concerning
- Coreference destination (feature 1): κ = .67
- Coreference type (feature 2): κ = .66
Coreferencing using CesacGoal of CESAC
Example 2
Coreferencing using CesacGoal of CESAC
• Clean text fragment with translationAnt warshipe hire easkeđ. Hweonene cumest tu fearlac deađes munegunge. Ich cume he seiđ of helle.And Worship him asked, ‘From where come you, Fearlac, death’s reminder?’ ‘I come’, he said, ‘from hell.’
• Text fragment in CESAC coreference file170.64 [2031 ant[2033 warschipe][2035 hire] easkeđ. Hweonene[2042[2043 ] cumest [2045 tu][2047 fearlac[2049 deađes munegunge]]] .]170.65 [2053[2054 Ich] cume[2057[2058 he] seiđ] of[2063 helle] .]
• Text fragment in Penn-Treebank file( (IP-MAT (CONJ ant) (NP-SBJ (N warschipe)) (NP-OB1 (PRO hire)) (VBP easke+d) (, .) (CP-QUE-SPE (WADVP-1 (WADV Hweonene))
(IP-SUB-SPE (ADVP-DIR *T*-1) (VBP cumest) (NP-SBJ (CODE <Coref_Id="346"_Ref="345"_Type="CrossSpeech"_NdDist="10"_/>) (PRO tu)) (NP-VOC (CODE <Coref_Id="50"_Ref="346"_Type="Identity"_NdDist="2"_/>) (N fearlac)
(NP-PRN (CODE <Coref_Id="49"_Ref="50"_Type="Identity"_NdDist="2"_/>) (N$ dea+des) (N munegunge))))) (E_S .)) (ID CMSAWLES,170.64))( (IP-MAT-SPE (NP-SBJ (CODE <Coref_Id="347"_Ref="346"_Type=“CrossSpeech"_NdDist="9"_/>) (PRO Ich))
(VBP cume) (IP-MAT-PRN (NP-SBJ (CODE <Coref_Id="348"_Ref="347"_Type="CrossSpeech"_NdDist="4"_/>) (PRO he))
(VBP sei+d)) (PP (P of) (NP (NPR helle))) (E_S .)) (ID CMSAWLES,170.65))
Summary and conclusion• Annotation program CESAC
- Input: standard Penn-Treebank
- Output: relatively easy to analyse
- Inter-rater agreement measured
• Operationalizing IS
- 4 coreference types
- As objective as possible: facts vs. interpretations
• Plans
- Fixed set of coreference types
- Larger corpus of coreferenced texts
Coreferencing using CesacGoal of CESAC