error analysis for learning-based coreference resolution olga uryupina 27.05.08

Post on 13-Dec-2015

225 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Error Analysis for Learning-based Coreference Resolution

Olga Uryupina27.05.08

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies

Coreference Resolution

„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.

..Globalstar still needs to raise $ 600 million,

and Schwartz said that the company would try..

Coreference Resolution

„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.

..Globalstar still needs to raise $ 600 million,

and Schwartz said that the company would try..

Coreference Resolution

„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.

..Globalstar still needs to raise $ 600 million,

and Schwartz said that the company would try..

Machine Learning Approaches

• Soon et al (2000)• Cardie & Wagstaff (1999)• Strube et al. (2002)• Ng & Cardie (2001-2004)• ACE competition

Features: Soon et al. (2000)

1. Anaphor is a pronoun2. Anaphor is a definite NP3. Anaphor is an NP with a demonstrative pronoun

(„this“,..)4. Antecedent is a pronoun5. Both markables are proper names6. Number agreement7. Gender agreement8. Alias9. Appositive10. Same surface form11. Semantic class agreement12. Distance in sentences

Features: other approaches

Cardie & Wagstaff: 11 FeaturesStrube et al.: 17 Features (the same

standard features + approximate matching (MED))

Ng & Cardie: 53 Features (no improvement on the extended feature set, better results (F=63.4) with manual feature selection)

Performance: Soon et al.

Soon et al‘s system:

Our reimlementation:

C5.0, optimized 56.1 65.5 60.4

C4.5, not optimized

53.5 72.8 61.7

Ripper 44.6 74.8 55.9

SVM 50.9 68.8 58.5

MaxEnt 49.2 64.1 55.7

Performance: Soon et al.

Learning Curve for C5.0

474951535557596163

10 15 20 25 30

Tricky and easy anaphors

Cristea et al. (2002): state-of-the-art coreference resolution systems have essentially the same performance level

Pronominal anaphora – 80%Full-scale coreference – 60%

Hypothesis: tricky vs. easy anaphors

Our system

Goal:Bridge the gap between the theory and the practice:

sophisticated linguistic knowledge + data-driven coreference resolution algorithm

New Features

Different aspects of CR:• Surface similarity (122 features)• Syntax (64)• Semantic Compatibility (29)• Salience (136)• (Anaphoricity)

More or less sophisticated linguistic theories exist for all these phenomena

Evaluation

Methodology• Standart dataset (MUC-7)• Standard learning set-up• Compare to Soon et al. (2001)

Performance (F)

Basic feature set

Extended f. set

Soon et al., C5.0

60.4 N/A

C4.5 61.7 64.6

SVM 58.5 65.4

Ripper 55.9 57.5

MaxEnt 55.7 59.4

Performance

Learning Curve, SVM

505254565860626466

10 15 20 25 30

Error analysis

Different approaches – same performance:

• Same errors?• „Tricky anaphors“? (Cristea et al.,

2002)

Extensive error analysis needed!

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies

Recall errors

Errors %

MUC 17 3.6

Markables 166 35.4

Propagated P 31 6.6

Pronouns 77 16.4

NE-matching 31 6.6

Syntax 39 8.3

Nominal anaphora

104 22.2

total 469 100

Recall errors - markables

• Auxilliary doc parts• Tokenization• Modifiers• Bracketing/labeling

Recall errors - markables

.. there was no requirement for tether to be manufactured in a contaminant-free enviroment.

A mesmerizing set.

Recall errors - pronouns

1st pl – reconstructing the group:The retiring Republican chairman of the House

Committee on Science want U.S. Businesses to <..> „We need to make it easier for the private sector..“ Walker said

3rd sg, 3rd pl – (non-)salience:[The explanation] for the History Channel‘s success

begin with its association with another channel owned by the same parent consortium.

Recall errors - nominal

Mostly common noun phrases with different heads, WordNet does not help much

.. a report on the satellites‘ findings <..> the abilities of U.S. Reconnaissance technology <..> the use of advanced intelligence-gathering tools <..> Remote-sensing instruments..

Precision errors

Errors %

MUC 30 7.4

Markables 76 18.6

Pronouns 78 19.1

NE-matching 20 4.9

Syntax 22 5.4

Nominal anaphora

182 44.6

total 408 100

Precision errors- pronouns

• incorrect Parsing/TaggingTwo key vice presidents, [Wei Yen] and Eric Carlson, are leaving to start their own Silicon Valley companies.

• (non-)salience• matching (propagated R)

Precision errors - nominal

Mostly same-head descriptions. Possible solutions:

• modifiers?• anaphoricicty detectors?

P errors – nominal - modifiers

Idea: „red car“ cannot corefer with „blue car“

Problem: list of mutually incompatible properties?

MUC7 test data:incompatible modifiers 30„new“ mod for anaphora 15compatible modifiers 58no modifiers 62

P errors – nominal - dnew

Idea: identify and discard unlikely anaphors

Problem: even a very good detector does not help

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: Possible remedies

Discussion – Errors

Problematic areas:• Data• Preprocessing modules• Features• Resolution strategy

Discussion - Data

• bigger corpus• more uniform doc selection, text

only • better definition of COREF• better scoring

Discussion - Preprocessing

• local improvements (e.g. appositions)

• probabilistic architecture to neutralize errors

Discussion - Features

• feature selection• ensemble learning• more targeted learning for under-

represented phenomena (abbreviations)

Discussion - Resolution

• less local: move to the chains level• less uniform: specific treatment for

different types of anaphors

Discussion – Conclusion

• ML approaches to the Coreference Resolution yield similar performance values

• Some anaphors are indeed tricky (esp. crucial for precision errors)

• But some errors can be eliminated within a ML framework– improving the training material– elaborated integration of preprocessing

modules– more global resolution strategies

Thank You!

Recall errors

Errors %

MUC 17 3.6

Markables 166 35.4

Propagated P 31 6.6

Pronouns 77 16.4

NE-matching 31 6.6

Syntax 39 8.3

Nominal anaphora

104 22.2

total 469 100

Recall errors - MUC

Mainly incorrect bracketing

..said <COREF .. MIN=„vice president“>Jim Johannesen, <COREF .. MIN=„vice president“>vice president of site development for McDonald‘s</COREF></COREF>..

Only clear typos etc considered MUC-errors

Recall errors – propagated P

The company also said the Marine Corps has begun testing two of [its radars] as part of a short-range ballistic missile defense program. That testing could lead to an order for the radars.

Crucial for pronouns and indicators for intrasentential coreference

Recall errors - matching

Mostly ORGANIZATIONs. Problems:• Abbreviations

Federal Communication Commission FCC

• Hyphenated names Ziff-Davis Publishing Ziff

• Foreign namesTaiwan President Lee Teng-huiPresident Lee

Recall errors - syntax

Apposition, copulaProblems:• Parsing mistakes• Missing constructions

..the venture will become synonymous with JSkyB

• P/R trade-off ..Kevlar, a synthetic fiber, and Nomex..

Quantitative constructions.. More than quadruple the three-month daily average of

88,700 shares

Precision errors

Errors %

MUC 30 7.4

Markables 76 18.6

Pronouns 78 19.1

NE-matching 20 4.9

Syntax 22 5.4

Nominal anaphora

182 44.6

total 408 100

Precision errors - matching

Finer NE analysis could help, but mostly too difficult even for humans:Loral

Loral Space and Communications CorpLoral SpaceSpace Systems Loral

Anaphoricity

Some markables are not anaphors. We can tell that by looking at them, without any sophisticated coreference resolution.

Poesio & Vieira, Ng & Cardie – try to identify Discourse New entities automatically

Not used for this talk

Anaphoricity

Some markables are not anaphors. We can tell that by looking at them, without any sophisticated coreference resolution.

Poesio & Vieira, Ng & Cardie – try to identify Discourse New entities automatically

Not used for this talk

top related