entity centric coreference resolution with model stacking

Entity Centric Coreference

Resolution with Model Stacking

Kevin Clark and Christopher D. Manning

(ACL-IJCNLP 2015)

(Tables are taken from the above-mentioned paper)

Presented by Mamoru Komachi

<[email protected]>

ACL 2015 Reading Group @ Tokyo Institute of Technology

August 26th, 2015

Entity-level information allows early coreference

decisions to inform later ones

Entity-centric coreference systems build up

coreference clusters incrementally (Raghunathan et

al., 2010; Stoyanov and Eisner, 2012; Ma et al.,

2014)

2

Hillary Clinton files for divorce from Bill Clinton ahead

of her campaign for presidency for 2016.

….

Clinton is confident that her poll numbers will skyrocket

once the divorce is final.

?!?

Problem: How to build up clusters

effectively?

Model stacking

Two mention pair models: classification model

and ranking model

Generates clusters features for clusters of

mentions

Imitation learning

Assigns exact costs to actions based on

coreference evaluation metrics

Uses the scores of the pairwise models to reduce

the search space

3

Mention Pair ModelsPrevious approach using local information

4

Two models for predicting whether a given pair of

mentions belong to the same coreference cluster

Is it a coreferent?

Classification model

Which one best suites for the mention?

Ranking model

5

Bill arrived, but nobody saw him.

I talked to him on the phone.

Logistic classifiers for classification model

M: set of all mentions in the training set

T(m): set of true antecedents of a mention m

F(m): set of false antecedents of m

Considers each pair of mentions independently

6

Logistic classifiers for ranking model

Considers candidate antecedents simultaneously

Max-margin training encourages the model to find

the single best antecedent for a mention, but it is

not robust for a downstream clustering model

7

Features for mention pair model

Distance features: the distance between the two mentions in sentences or number of mentions

Syntactic features: number of embedded NPs under a mention, POS tags of the first, last, and head word

Semantic features: named entity type, speaker identification

Rule-based features: exact and partial string matching

Lexical features: the first, last, and head word of the current mention

8

Entity-Centric

Coreference ModelProposed approach using cluster features

9

Entity-centric model can exhibit high

coherency

Best first clustering (Ng and Cardie, 2002)

Assigns the most probable preceding mention

classified as coreferent with it as the antecedent

Only relies on local information

Entity-centric model (this work)

Operates between pairs of clusters instead of pairs

of mentions

Builds up coreference chains with agglomerative

clustering, by merging clusters if it predicts they are

representing the same one

10

Inference

Reducing the search

space by using a

threshold from

mention-pair models

Sort P to perform

easy-first clustering

s is a scoring

function to make a

binary decision for

merge action

11

Learning entity-centric model by imitation learning

Sequential prediction problem: future observations

depend on previous actions

Imitation learning (in this work, DAgger (Ross al.,

2011)), is useful for this problem (Argall et al., 2009)

Training the agent on the gold labels alone assumes

that all previous decisions were correct, but it is

problematic in coreference, where the error rate is

quite high

DAgger exposes the system to states at train time

similar to the ones it will face at test time12

Learning cluster merging policy

by DAgger (Ross et al., 2011)

Iterative algorithm

aggregating a dataset D

consisting of states and the

actions performed by the

expert policy in those

states

b controls the probability of

the expert’s policy and

current policy (decays

exponentially as the

iteration number increases)

13

Adding cost to actions: Directly tune to

optimize coreference metrics

Merging clusters (order of merge operations is also

important) influence the score

How a particular local decision will affect the final

score of the coreference system?

Problem: standard coreference metrics do not

decompose into clusters

Answer: rolling out the actions from the current state

14A(s): set of actions that can be taken from the state s

Cluster features for classification model

and ranking model

Between clusters features

Minimum and maximum probability of coreference

Average probability and average log prob. of coreference

Average probability and log probability of coreference for a

particular pair of grammar types of mentions (pron or not)

15

Only 56 features for entity-centric model

State features

Whether a preceding mention pair in the list of

mention pairs has the same candidate anaphor as

the current one

The index of the current mention pair in the list

divided by the size of the list (what percentage of the

list have we seen so far?)

…

Entity-centric model doesn’t rely on sparse lexical

features. Instead, it employs model stacking to

exploit strong features (with scores learned from

pairwise model)16

Results and discussionsCoNLL 2012 English coreference task

17

Experimental setup:

CoNLL 2012 Shared Task

English portion of OntoNotes

Training: 2802, development: 343, test:345 documents

Use the provided pre-processing (parse trees, NE, etc)

Common evaluation metrics

MUC, B3, CEAFE

CoNLL F1 (the average F1 score of the three metrics)

CoNLL scorer version 8.01

Rule-based mention detection (Raghunathan et al., 2010)

18

Results: Entity-centric model outperforms best-

first clustering in both classification and ranking

19

Entity-centric model beats other state-of-

the-art coreference models

20

This work primarily optimize for B3 metric during training

State-of-the-art systems use latent antecedents to learn

scoring functions over mention pairs, but are trained to

maximize global objective functions

Entity-centric model directly learns a coreference

model that maximizes an evaluation metric

Post-processing of mention pair and ranking models

Closest-first clustering (Soon et al., 2001)

Best-first clustering (Ng and Cardie, 2002)

Global inference models

Global inference with integer linear programming

(Denis and Baldridge, 2007; Finkel and Manning,

2008)

Graph partitioning (McCallum and Wellner, 2005;

Nicolae and Nicolae, 2006)

Correlational clustering (McCallum and Wellner,

2003; Finely and Joachims, 2005)21

Previous approaches do not directly tune

against coreference metrics

Non-local entity-level information

Cluster model (Luo et al., 2004; Yang et al., 2008;

Rahman and Ng, 2011)

Joint inference (McCallum and Wellner, 2003;

Culotta et al., 2006; Poon and Domingos, 2008;

Haghighi and Klein, 2010)

Learning trajectories of decisions

Imitation learning (Daume et al., 2005; Ma et al.,

2014)

Structured perceptron (Stoyanov and Eisner, 2012;

Fernandes et al., 2012; Bjoerkelund and Kuhn, 2014)

22

Summary Proposed an entity-centric coreference model using

the scores produced by mention pair models as

features

Pairwise scores are learned using standard

coreference metrics

Imitation learning can be used to learn how to build

up coreference chains incrementally

Proposed model outperforms the commonly used

best-first method and current state-of-the-art

23

entity centric coreference resolution with model stacking

Engineering