coreferencing treebank data using cesac annotating and analysing is in corpora of historical english...

16
Coreferencing Treebank data using CESAC Annotating and analysing IS in corpora of historical English Berlin, 13-14 November 2009

Upload: shauna-jenkins

Post on 17-Dec-2015

220 views

Category:

Documents


4 download

TRANSCRIPT

Coreferencing Treebank data using CESAC

Annotating and analysing IS in corpora of historical EnglishBerlin, 13-14 November 2009

Overview• CESAC

- Goals

- Coreference types: operationalizing IS

- Input and output

• Inter-rater agreement

• Example

• Summary and conclusion

Coreferencing using CesacContents

CESAC goals• Overall goal

- Referring from any one constituent to any other constituent

• More specifically

- Source and destination: IP/phrase/node or DP/lexeme/endnode

- Attributes

1. Coreference type

2. Distance measure

3. (NP type: definite, indefinite, etc.)

4. (Animacy)

Coreferencing using CesacGoal of CESAC

CESAC coreference types: operationalizing IS• Two basic rules

- Do not omit possible coreference information

- A source should be linked to the nearest possible destination

• Labels encoding different forms of anaphoricity

- Identity

‘Jacqueline plays the cello. She is an amazing musician’

- Cross Speech

‘John said to Paul: “Why don’t you play the guitar?”’

- Inferred

‘Do you see that house? They say the kitchen is extremely spacious’

- World knowledge (separate category)

‘According to Burt Reynolds, all dogs go to heaven’

• Cross Speech > Identity > Inferred

• Encoding facts vs. encoding interpretations: objective data

Coreferencing using CesacGoal of CESAC

CESAC input 1• Standard Penn-Treebank format

- Collection of <nodes>

- Each <node> consists of

1. Brackets: (…)

2. Label: (NP …)

3. Other node: (NP (N …) )

4. Lexeme: (N man)

5. Possibly <lexeme>+<node>: (P to (NP him))

- Attributes in label

(NP-ACC (PRO^A hine))

- Extra-textual data in CODE nodes

(CODE <TEXT: +tyl+aste>)

Coreferencing using CesacThe input: Penn-Treebank

EndNode

Node

CESAC input 2

Coreferencing using CesacThe input: Penn-Treebank

( (CODE <T06080009600,11.4>) (IP-MAT (CONJ And)

(NP-NOM (D^N +t+at) (N^N folc)) (NP-ACC (PRO^A hine)) (ADVP-TMP (ADV^T +ta)) (PP (P mid) (NP-DAT (ADJ^D unasecgendlicre) (N^D wur+dmynte))) (PP (P to) (NP-DAT (N^D scipe))) (VBDI gel+addon) (. ,)) (ID coapollo,ApT:11.4.183))

( (IP-MAT (CONJ and) (NP-NOM (NR^N Apollonius)) (NP-ACC-1 (PRO^A hi)) (VBDI b+ad) (IP-INF (NP-ACC-SBJ *ICH*-1)

(QP-ACC (Q^A ealle)) (VB $gretan))) (ID coapollo,ApT:11.4.184))

matching brackets

Label

CESAC output 1• Penn-Treebank format

• Enriched with coreference information

- Source node ID

- Destination node ID

- Coreference type

- Coreference distance – derivable

• Destination node example

(NP-SBJ (CODE <Coref_Id="339"_/>) (NPR Crist))

• Source node example

(NP-OB1

(CODE <Coref_Id="20"_Ref="21"_Type="Identity"_NdDist="16"_/>)

(PRO hem) )

Coreferencing using Cesacenriched Penn-Treebank

CESAC output 2

Coreferencing using Cesacenriched Penn-Treebank

NP-SBJ

NPR

Crist

NP-SBJ

CODE

<Coref Id=“310”>

NPR

Crist

Destination node

NP-OB1

PRO

hem

NP-OB1

CODE

<Coref Id=“340” Ref=“310” Type=“Identity>

PRO

hem

Source node

<node> = one-or-more <node>OR <lexeme>

<lexeme> + <node>

CESAC output 3

Coreferencing using Cesacenriched Penn-Treebank

( (IP-MAT (CONJ and) (NP-NOM *con* (CODE <Coref_Id="1488"_Ref="1489"_Type="Identity"_NdDist="12"_/>)) (VBD l+adde) (NP-ACC (CODE <Coref_Id="1476"_Ref="1477"_Type="Identity"_NdDist="8"_/>) (PRO^A hine)) (PP (P mid) (NP-DAT-RFL

(CODE <Coref_Id="1487"_Ref="1488"_Type="Identity"_NdDist="6"_/>) (PRO^D him)))

(PP (P to) (NP-DAT

(PRO$ his (CODE <Coref_Id="1486"_Ref="1487"_Type="Identity"_NdDist="5"_/>))

(N^D huse)))) (ID coapollo,ApT:12.16.209))

CESAC output 4

Coreferencing using Cesacenriched Penn-Treebank

Source nodeNP-DAT

PRO$

his

PP

P

to

CODE

<Coref Id=“20” Ref=“21” Type=“Identity>

NP-DAT

PRO$

his

PP

P

to

<node> = one-or-more <node>OR <node> <lexeme>OR <lexeme>

Inter-rater agreement 1• Two features measured

- Coreference destination (node ID)

- Coreference type

• Adapted version of Cohen’s kappa: κ > .6

• Two important problems

- Identity vs. cross speech

- Omission of link

• Solutions

- Create new rule(s)

- Adapt/specify existing rule(s)

Coreferencing using CesacGoal of CESAC

Inter-rater agreement 2• Tool used to calculate inter-rater agreement concerning

- Coreference destination (feature 1): κ = .67

- Coreference type (feature 2): κ = .66

Coreferencing using CesacGoal of CESAC

Example 1

Coreferencing using CesacGoal of CESAC

Example 2

Coreferencing using CesacGoal of CESAC

• Clean text fragment with translationAnt warshipe hire easkeđ. Hweonene cumest tu fearlac deađes munegunge. Ich cume he seiđ of helle.And Worship him asked, ‘From where come you, Fearlac, death’s reminder?’ ‘I come’, he said, ‘from hell.’

• Text fragment in CESAC coreference file170.64 [2031 ant[2033 warschipe][2035 hire] easkeđ. Hweonene[2042[2043 ] cumest [2045 tu][2047 fearlac[2049 deađes munegunge]]] .]170.65 [2053[2054 Ich] cume[2057[2058 he] seiđ] of[2063 helle] .]

• Text fragment in Penn-Treebank file( (IP-MAT (CONJ ant) (NP-SBJ (N warschipe)) (NP-OB1 (PRO hire)) (VBP easke+d) (, .) (CP-QUE-SPE (WADVP-1 (WADV Hweonene))

(IP-SUB-SPE (ADVP-DIR *T*-1) (VBP cumest) (NP-SBJ (CODE <Coref_Id="346"_Ref="345"_Type="CrossSpeech"_NdDist="10"_/>) (PRO tu)) (NP-VOC (CODE <Coref_Id="50"_Ref="346"_Type="Identity"_NdDist="2"_/>) (N fearlac)

(NP-PRN (CODE <Coref_Id="49"_Ref="50"_Type="Identity"_NdDist="2"_/>) (N$ dea+des) (N munegunge))))) (E_S .)) (ID CMSAWLES,170.64))( (IP-MAT-SPE (NP-SBJ (CODE <Coref_Id="347"_Ref="346"_Type=“CrossSpeech"_NdDist="9"_/>) (PRO Ich))

(VBP cume) (IP-MAT-PRN (NP-SBJ (CODE <Coref_Id="348"_Ref="347"_Type="CrossSpeech"_NdDist="4"_/>) (PRO he))

(VBP sei+d)) (PP (P of) (NP (NPR helle))) (E_S .)) (ID CMSAWLES,170.65))

Summary and conclusion• Annotation program CESAC

- Input: standard Penn-Treebank

- Output: relatively easy to analyse

- Inter-rater agreement measured

• Operationalizing IS

- 4 coreference types

- As objective as possible: facts vs. interpretations

• Plans

- Fixed set of coreference types

- Larger corpus of coreferenced texts

Coreferencing using CesacGoal of CESAC

Thank you for your attention!

Coreferencing using CesacGoal of CESAC