unsupervised models for coreference resolution

283
Unsupervised Models for Coreference Resolution Vincent Ng Human Language Technology Research Institute University of Texas at Dallas

Upload: joshua

Post on 20-Jan-2016

68 views

Category:

Documents


1 download

DESCRIPTION

Unsupervised Models for Coreference Resolution. Vincent Ng Human Language Technology Research Institute University of Texas at Dallas. Plan for the Talk. Supervised learning for coreference resolution how and when supervised coreference research started - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised Models for                               Coreference Resolution

Unsupervised Models for Coreference Resolution

Vincent NgHuman Language Technology Research

InstituteUniversity of Texas at Dallas

Page 2: Unsupervised Models for                               Coreference Resolution

2

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

Page 3: Unsupervised Models for                               Coreference Resolution

3

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Page 4: Unsupervised Models for                               Coreference Resolution

4

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

Page 5: Unsupervised Models for                               Coreference Resolution

5

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

identified as an important task for information extractionidentity coreference only

Page 6: Unsupervised Models for                               Coreference Resolution

6

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 7: Unsupervised Models for                               Coreference Resolution

7

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 8: Unsupervised Models for                               Coreference Resolution

8

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 9: Unsupervised Models for                               Coreference Resolution

9

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 10: Unsupervised Models for                               Coreference Resolution

10

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 11: Unsupervised Models for                               Coreference Resolution

11

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 12: Unsupervised Models for                               Coreference Resolution

12

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Lots of prior work on supervised coreference resolution

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 13: Unsupervised Models for                               Coreference Resolution

13

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

Page 14: Unsupervised Models for                               Coreference Resolution

14

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

Page 15: Unsupervised Models for                               Coreference Resolution

15

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise coreference decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

Page 16: Unsupervised Models for                               Coreference Resolution

16

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

Page 17: Unsupervised Models for                               Coreference Resolution

17

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

Page 18: Unsupervised Models for                               Coreference Resolution

18

Standard Supervised Learning ApproachTypically relies on a large amount of labeled data

What if we only have a small amount of annotated data?

Page 19: Unsupervised Models for                               Coreference Resolution

19

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm feature setclustering algorithm

Page 20: Unsupervised Models for                               Coreference Resolution

20

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm (Bayes)feature setclustering algorithm (Bell-tree)

Page 21: Unsupervised Models for                               Coreference Resolution

21

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

Page 22: Unsupervised Models for                               Coreference Resolution

22

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

Coref, Not Coref

Page 23: Unsupervised Models for                               Coreference Resolution

23

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy

Page 24: Unsupervised Models for                               Coreference Resolution

24

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

Page 25: Unsupervised Models for                               Coreference Resolution

25

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

What features to use in the feature representation?

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

Page 26: Unsupervised Models for                               Coreference Resolution

26

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 27: Unsupervised Models for                               Coreference Resolution

27

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 28: Unsupervised Models for                               Coreference Resolution

28

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 29: Unsupervised Models for                               Coreference Resolution

29

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 30: Unsupervised Models for                               Coreference Resolution

30

Use 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Linguistic Features

E.g., for the mention pair (Barack Obama, president-elect), the feature value is (Name, Nominal)

Page 31: Unsupervised Models for                               Coreference Resolution

31

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

Page 32: Unsupervised Models for                               Coreference Resolution

32

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

Page 33: Unsupervised Models for                               Coreference Resolution

33

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

Let’s simplify this term!

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

Page 34: Unsupervised Models for                               Coreference Resolution

34

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

Let’s simplify this term!assume that feature values from different groups are

independent of each other given the class

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

Page 35: Unsupervised Models for                               Coreference Resolution

35

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

Page 36: Unsupervised Models for                               Coreference Resolution

36

)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

COREF or NOT COREF

Page 37: Unsupervised Models for                               Coreference Resolution

37

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

COREF or NOT COREF

Page 38: Unsupervised Models for                               Coreference Resolution

38

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y)

COREF or NOT COREF

Page 39: Unsupervised Models for                               Coreference Resolution

39

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

COREF or NOT COREF

Page 40: Unsupervised Models for                               Coreference Resolution

40

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

COREF or NOT COREF

Page 41: Unsupervised Models for                               Coreference Resolution

41

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

Given y, generate x7 with P(x7 | y)

COREF or NOT COREF

Page 42: Unsupervised Models for                               Coreference Resolution

42

train on whatever annotated data we have

need to specify learning algorithm feature set clustering algorithm

First Attempt: Supervised Learning

Page 43: Unsupervised Models for                               Coreference Resolution

43

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

Page 44: Unsupervised Models for                               Coreference Resolution

44

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

Page 45: Unsupervised Models for                               Coreference Resolution

45

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

Page 46: Unsupervised Models for                               Coreference Resolution

46

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

Page 47: Unsupervised Models for                               Coreference Resolution

47

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 48: Unsupervised Models for                               Coreference Resolution

48

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 49: Unsupervised Models for                               Coreference Resolution

49

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

Page 50: Unsupervised Models for                               Coreference Resolution

50

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

Computationally infeasible to expand all nodes in the Bell tree

Page 51: Unsupervised Models for                               Coreference Resolution

51

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

Page 52: Unsupervised Models for                               Coreference Resolution

52

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

How to determine which nodes are promising?

Page 53: Unsupervised Models for                               Coreference Resolution

53

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

Page 54: Unsupervised Models for                               Coreference Resolution

54

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

1

Page 55: Unsupervised Models for                               Coreference Resolution

55

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

Page 56: Unsupervised Models for                               Coreference Resolution

56

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

1 * Pc(1,2) = 1 * 0.6 = 0.6

Page 57: Unsupervised Models for                               Coreference Resolution

57

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4

Page 58: Unsupervised Models for                               Coreference Resolution

58

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

Page 59: Unsupervised Models for                               Coreference Resolution

59

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 60: Unsupervised Models for                               Coreference Resolution

60

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42

Page 61: Unsupervised Models for                               Coreference Resolution

61

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

Page 62: Unsupervised Models for                               Coreference Resolution

62

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

Page 63: Unsupervised Models for                               Coreference Resolution

63

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

expands only the N most probable nodes at each level

Page 64: Unsupervised Models for                               Coreference Resolution

64

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Page 65: Unsupervised Models for                               Coreference Resolution

65

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Goal: evaluate this coreference system in the presence of a small amount of labeled data

Page 66: Unsupervised Models for                               Coreference Resolution

66

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Experimental Setup

Page 67: Unsupervised Models for                               Coreference Resolution

67

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring programCEAF scoring program (Luo, 2005)

recall, precision, F-measure

Experimental Setup

Page 68: Unsupervised Models for                               Coreference Resolution

68

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Page 69: Unsupervised Models for                               Coreference Resolution

69

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Page 70: Unsupervised Models for                               Coreference Resolution

70

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Page 71: Unsupervised Models for                               Coreference Resolution

71

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Page 72: Unsupervised Models for                               Coreference Resolution

72

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Can we improve performance by combining a small amount of labeled data and

a potentially large amount of unlabeled data?

Page 73: Unsupervised Models for                               Coreference Resolution

73

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

Page 74: Unsupervised Models for                               Coreference Resolution

74

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Page 75: Unsupervised Models for                               Coreference Resolution

75

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

Page 76: Unsupervised Models for                               Coreference Resolution

76

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

Page 77: Unsupervised Models for                               Coreference Resolution

77

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

N most confidently labeled instances

Page 78: Unsupervised Models for                               Coreference Resolution

78

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

Broadcast News Newswire

Page 79: Unsupervised Models for                               Coreference Resolution

79

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Page 80: Unsupervised Models for                               Coreference Resolution

80

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

Coref Coref Coref Coref

Page 81: Unsupervised Models for                               Coreference Resolution

81

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

Coref Coref Coref Coref

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

Page 82: Unsupervised Models for                               Coreference Resolution

82

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

the classifier gradually learns that two proper names are likely to be coreferent, regardless of whether the names are identical

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

Coref Coref Coref Coref

Page 83: Unsupervised Models for                               Coreference Resolution

83

Why does Self-Training hurt?Since we hypothesize that the Mention Pair Type feature is

causing the problem …repeat the experiments without using this feature

Page 84: Unsupervised Models for                               Coreference Resolution

84

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

no MP Type feature with MP Type feature

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

no MP Type feature with MP Type feature

Broadcast News Newswire

Page 85: Unsupervised Models for                               Coreference Resolution

85

Some Lessons Learnedwhen labeled data is scarce, feature design becomes an

important issue

when exploiting unlabeled data, it is crucial to learn from both confidently labeled and not-so-confidently labeled data

Page 86: Unsupervised Models for                               Coreference Resolution

86

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

Page 87: Unsupervised Models for                               Coreference Resolution

87

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

Page 88: Unsupervised Models for                               Coreference Resolution

88

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

the EM-based model is forced to learn from all of the mention pairs when the model is retrained

Page 89: Unsupervised Models for                               Coreference Resolution

89

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Page 90: Unsupervised Models for                               Coreference Resolution

90

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

Page 91: Unsupervised Models for                               Coreference Resolution

91

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

Coreferent

Page 92: Unsupervised Models for                               Coreference Resolution

92

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Not Coreferent

1 2 3 4 5

1

2

3

4

5

Page 93: Unsupervised Models for                               Coreference Resolution

93

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

Representing a Clustering

Don’t care about diagonal entries

1 2 3 4 5

1

2

3

4

5

Page 94: Unsupervised Models for                               Coreference Resolution

94

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Don’t care about entries below the diagonal

1 2 3 4 5

1

2

3

4

5

Page 95: Unsupervised Models for                               Coreference Resolution

95

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

1 2 3 4 5

1

2

3

4

5

Representing a Clustering

Transitive

Page 96: Unsupervised Models for                               Coreference Resolution

96

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Valid

1 2 3 4 5

1

2

3

4

5

Page 97: Unsupervised Models for                               Coreference Resolution

97

1 2 3 4 5

1

2

3

4

5

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Valid Invalid

1 2 3 4 5

1

2

3

4

5

Page 98: Unsupervised Models for                               Coreference Resolution

98

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

Page 99: Unsupervised Models for                               Coreference Resolution

99

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C?

)|()(),( CDPCPCDP

Page 100: Unsupervised Models for                               Coreference Resolution

100

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C? Assume that D is represented by its mention pairs

)|()(),( CDPCPCDP

Page 101: Unsupervised Models for                               Coreference Resolution

101

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C? Assume that D is represented by its mention pairs To generate D, generate all pairs of mentions in D

(Queen Elizabeth, her), (Queen Elizabeth, husband), (Queen Elizabeth, King George VI), …

)|()(),( CDPCPCDP

Page 102: Unsupervised Models for                               Coreference Resolution

102

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

Page 103: Unsupervised Models for                               Coreference Resolution

103

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

mpij is the pair formed from mention i and mention j

Page 104: Unsupervised Models for                               Coreference Resolution

104

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

Page 105: Unsupervised Models for                               Coreference Resolution

105

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term assume that each mention pair mpij is generated

conditionally independently given C ij

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

Page 106: Unsupervised Models for                               Coreference Resolution

106

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

)|()()(

DPairs ijij CmpPCP

)|()( ...,14,13,12 CmpmpmpPCP

Page 107: Unsupervised Models for                               Coreference Resolution

107

)|()()(

DPairs ijij CmpPCP

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to represent a mention pair mij?

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

Page 108: Unsupervised Models for                               Coreference Resolution

108

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 109: Unsupervised Models for                               Coreference Resolution

109

Given a document D,generate a clustering C according to P(C)generate D given C

)|()()(

DPairs ijij CmpPCP

The Generative Model

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

Page 110: Unsupervised Models for                               Coreference Resolution

110

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

7 feature values

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

Page 111: Unsupervised Models for                               Coreference Resolution

111

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

Page 112: Unsupervised Models for                               Coreference Resolution

112

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term assume that feature values from different groups are

conditionally independent of each other

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

Page 113: Unsupervised Models for                               Coreference Resolution

113

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

Page 114: Unsupervised Models for                               Coreference Resolution

114

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

Page 115: Unsupervised Models for                               Coreference Resolution

115

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

Page 116: Unsupervised Models for                               Coreference Resolution

116

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

Page 117: Unsupervised Models for                               Coreference Resolution

117

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

Page 118: Unsupervised Models for                               Coreference Resolution

118

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

{ Coref, Not Coref }c

If we had labeled data, we could estimate the parametersBut we don’t have labeled data. So …

Page 119: Unsupervised Models for                               Coreference Resolution

119

Model ParametersUse EM to iteratively

estimate the model parametersprobabilistically induce a clustering for a document

Page 120: Unsupervised Models for                               Coreference Resolution

120

The Induction Algorithm

Given a set of unlabeled documents

Page 121: Unsupervised Models for                               Coreference Resolution

121

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

Page 122: Unsupervised Models for                               Coreference Resolution

122

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

Initial labelings are presumably noisy

Page 123: Unsupervised Models for                               Coreference Resolution

123

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

Page 124: Unsupervised Models for                               Coreference Resolution

124

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

Page 125: Unsupervised Models for                               Coreference Resolution

125

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

Page 126: Unsupervised Models for                               Coreference Resolution

126

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3] + invalid clusterings

Page 127: Unsupervised Models for                               Coreference Resolution

127

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Page 128: Unsupervised Models for                               Coreference Resolution

128

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

Page 129: Unsupervised Models for                               Coreference Resolution

129

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

How to cope with the computational complexity

of the E-step?

Page 130: Unsupervised Models for                               Coreference Resolution

130

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

Page 131: Unsupervised Models for                               Coreference Resolution

131

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 132: Unsupervised Models for                               Coreference Resolution

132

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions of each document (E-step) use the normalized scores of the 50-best clusterings

The Induction Algorithm

Iterate till convergence

Page 133: Unsupervised Models for                               Coreference Resolution

133

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

Page 134: Unsupervised Models for                               Coreference Resolution

134

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Page 135: Unsupervised Models for                               Coreference Resolution

135

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 136: Unsupervised Models for                               Coreference Resolution

136

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

1Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 137: Unsupervised Models for                               Coreference Resolution

137

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

1 1Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Page 138: Unsupervised Models for                               Coreference Resolution

138

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

Page 139: Unsupervised Models for                               Coreference Resolution

139

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 3

4

2 2 5

4

Page 140: Unsupervised Models for                               Coreference Resolution

140

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mentionensures transitivity automatically

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 3

4

2 2 5

4

Page 141: Unsupervised Models for                               Coreference Resolution

141

Haghighi and Klein’s Generative Story

Page 142: Unsupervised Models for                               Coreference Resolution

142

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Page 143: Unsupervised Models for                               Coreference Resolution

143

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Page 144: Unsupervised Models for                               Coreference Resolution

144

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id

Page 145: Unsupervised Models for                               Coreference Resolution

145

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id two occurrences of “she” will likely be posited as coreferent particularly inappropriate for generating pronouns

Page 146: Unsupervised Models for                               Coreference Resolution

146

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster id

Extensions:use a separate “pronoun head model” to generate pronounsincorporate salience

Page 147: Unsupervised Models for                               Coreference Resolution

147

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications relaxed head generation agreement constraints pronoun-only salience

Plan for the Talk

Page 148: Unsupervised Models for                               Coreference Resolution

148

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Page 149: Unsupervised Models for                               Coreference Resolution

149

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Goalsimple method for incorporating such knowledge sources

Page 150: Unsupervised Models for                               Coreference Resolution

150

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

Page 151: Unsupervised Models for                               Coreference Resolution

151

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

International Business

Corporation

IBM

Barcelona

1

1

2

Page 152: Unsupervised Models for                               Coreference Resolution

152

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head id

International Business

Corporation

IBM

Barcelona

1

1

2

Page 153: Unsupervised Models for                               Coreference Resolution

153

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same head

International Business

Corporation

IBM

Barcelona

1

1

2

Page 154: Unsupervised Models for                               Coreference Resolution

154

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same headencourages the model to put the two into the same cluster

International Business

Corporation

IBM

Barcelona

1

1

2

Page 155: Unsupervised Models for                               Coreference Resolution

155

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

Page 156: Unsupervised Models for                               Coreference Resolution

156

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

Page 157: Unsupervised Models for                               Coreference Resolution

157

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

Page 158: Unsupervised Models for                               Coreference Resolution

158

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

Goalimplement gender and number agreement as a constraint

Page 159: Unsupervised Models for                               Coreference Resolution

159

disallow the generation of a mention by any cluster where the two are incompatible in number or gender

Modification 2: Agreement Constraints

Page 160: Unsupervised Models for                               Coreference Resolution

160

Modification 3: Pronoun-Only Salience

In H&K’s model, salience is applied to all types of mentions (pronouns, names and nominals) during cluster assignment

Our hypothesissince names and nominals are less sensitive to salience, the

net benefit of applying salience to names and nominals could be negative as a result of inaccurate modeling of salience

We restrict the application of salience to pronouns only

Page 161: Unsupervised Models for                               Coreference Resolution

161

Improving Haghighi and Klein’s Model3 modifications

relaxed head generationagreement constraintspronoun-only salience

Page 162: Unsupervised Models for                               Coreference Resolution

162

EvaluationEM-based model

Haghighi and Klein’s modelwith and without the 3 modifications

Page 163: Unsupervised Models for                               Coreference Resolution

163

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

For each data set use one training text for initializing model parameters evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring programCEAF scoring program (Luo, 2005)

Experimental Setup

Page 164: Unsupervised Models for                               Coreference Resolution

164

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Weakly Supervised Baseline)

Train the Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition for each test document using the pairwise probabilities

Page 165: Unsupervised Models for                               Coreference Resolution

165

Heuristic BaselineSimple rule-based system

Posits two mentions as coreferent if and only if they arethe same stringaliasesin an appositive relation

Page 166: Unsupervised Models for                               Coreference Resolution

166

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Heuristic Baseline)

Page 167: Unsupervised Models for                               Coreference Resolution

167

EM-Based ModelInitialize the parameters using one (labeled) document

rather than using randomly guessed clusterings

Page 168: Unsupervised Models for                               Coreference Resolution

168

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model)

Page 169: Unsupervised Models for                               Coreference Resolution

169

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model)

gains in both recall and precisionF-measure increases by 5-7%

Page 170: Unsupervised Models for                               Coreference Resolution

170

Duplicated Haghighi and Klein’s Model

Use the same labeled document as in the EM-based model to learn the value of in the Dirichlet Process

Page 171: Unsupervised Models for                               Coreference Resolution

171

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Duplicated H&K’s Model)

Page 172: Unsupervised Models for                               Coreference Resolution

172

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Duplicated H&K’s Model)

In comparison to EM-based modelprecision drops substantiallyF-measure decreases by 10-11%

Page 173: Unsupervised Models for                               Coreference Resolution

173

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

Page 174: Unsupervised Models for                               Coreference Resolution

174

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modification

Page 175: Unsupervised Models for                               Coreference Resolution

175

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modificationmodest gain in recall and substantial gain in precision when

all modifications are applied (9-10% gain in F-measure)

Page 176: Unsupervised Models for                               Coreference Resolution

176

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Fully-Supervised Resolver)

Trained using C4.5, entire ACE training set, 34 featuresOutperforms the unsupervised models by 7%

Page 177: Unsupervised Models for                               Coreference Resolution

177

Using a Knowledge-Based FeatureAdd a feature to the EM-based model that encodes the

output of a knowledge-based coreference systemimplements heuristics used by different MUC-7 resolvers

Resulting model not so “unsupervised”

Page 178: Unsupervised Models for                               Coreference Resolution

178

Broadcast News Newswire Experiments on System Mentions

R P F R P F

EM-based Model (w/ KB feature) 65.4 53.3 58.8 68.1 58.2 62.8

EM-based Model (w/o KB feature) 57.0 54.6 55.7 62.9 56.5 59.6

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model w/ KB Feature)

Page 179: Unsupervised Models for                               Coreference Resolution

179

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

EM-based model and modified H&K’s model outperform self-training and H&K’s original model

Not as competitive as fully-supervised model, but …

Page 180: Unsupervised Models for                               Coreference Resolution

180

Summary (Cont’)… they can potentially be improved by

incorporating additional linguistic features in

feature engineering remains a challenging issuecombining a large amount of labeled data with a large amount

of unlabeled data

generative modeling is interesting in itself

Page 181: Unsupervised Models for                               Coreference Resolution

181

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

Self-training with and without baggingDoesn’t improve (and sometimes even hurts) performanceAugment labeled data with only confidently-labeled instancesLittle knowledge is gained by the classifierCareful feature design is an especially important issueNeed to label both confident and not-so-confident instances

Page 182: Unsupervised Models for                               Coreference Resolution

182

Summary (Cont’)EM-based generative model

induces a clustering on an unlabeled documentoutperforms Haghighi and Klein’s coreference model

Three extensions to Haghighi and Klein’s generative model each modification improves F-measure

Not as competitive as fully-supervised modelbut … generative modeling is interesting in itselffeature engineering remains a crucial yet challenging issue

Page 183: Unsupervised Models for                               Coreference Resolution

183

Weakly Supervised BaselineTrain the Naïve Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition on each test document using the pairwise probabilities

Page 184: Unsupervised Models for                               Coreference Resolution

184

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring program MUC scoring program (Vilain et al., 1995) ????

Experimental Setup

Page 185: Unsupervised Models for                               Coreference Resolution

185

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring program MUC scoring program (Vilain et al., 1995) ????

2 problems under-penalizes partitions where mentions are over-clustered does not reward successful identification of singleton clusters

Experimental Setup

Page 186: Unsupervised Models for                               Coreference Resolution

186

)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

Not as naïve as Naïve Bayes …

COREF or NOT COREF

Page 187: Unsupervised Models for                               Coreference Resolution

187

Results (Self-Training w/ and w/o Bagging)

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/ bagging (5 bags) w/o bagging

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/ bagging (5 bags) w/o bagging

Broadcast News Newswire

Page 188: Unsupervised Models for                               Coreference Resolution

188

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Page 189: Unsupervised Models for                               Coreference Resolution

189

Self-Training with Bagging

Create k training sets, each of size |L|, by sampling from L with replacement

Train k classifiers

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Page 190: Unsupervised Models for                               Coreference Resolution

190

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

Page 191: Unsupervised Models for                               Coreference Resolution

191

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

Page 192: Unsupervised Models for                               Coreference Resolution

192

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

N labeled instances with the highest average confidence

Page 193: Unsupervised Models for                               Coreference Resolution

193

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Need to learn from both the confidently and no-so-confidently labeled instances

Page 194: Unsupervised Models for                               Coreference Resolution

194

Haghighi and Klein’s ModelNonparametric Bayesian model

Page 195: Unsupervised Models for                               Coreference Resolution

195

Haghighi and Klein’s ModelNonparametric Bayesian model

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

Page 196: Unsupervised Models for                               Coreference Resolution

196

Haghighi and Klein’s ModelNonparametric Bayesian model

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

Page 197: Unsupervised Models for                               Coreference Resolution

197

Haghighi and Klein’s ModelNonparametric Bayesian model

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

dXPXZPXZP )|(),|()|(

Page 198: Unsupervised Models for                               Coreference Resolution

198

Haghighi and Klein’s ModelNonparametric Bayesian model

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

dXPXZPXZP )|(),|()|(

Integrate out the parameters

Encode prior knowledge on hypotheses

Page 199: Unsupervised Models for                               Coreference Resolution

199

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 200: Unsupervised Models for                               Coreference Resolution

200

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

Page 201: Unsupervised Models for                               Coreference Resolution

201

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

How to determine which paths are promising?

Page 202: Unsupervised Models for                               Coreference Resolution

202

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6*(1- max (Pc(1,3), Pc(2,3))) = 0.6 * (1- max(0.2, 0.7)) = 0.58

0.42

Page 203: Unsupervised Models for                               Coreference Resolution

203

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

Page 204: Unsupervised Models for                               Coreference Resolution

204

Plan for the TalkSupervised learning for coreference resolution

brief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Page 205: Unsupervised Models for                               Coreference Resolution

205

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from texts annotated with coreference information feature vector: describes the two mentions

train a classifier using a machine learning algorithm decision tree learner (C5), maximum entropy, SVMs

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

Page 206: Unsupervised Models for                               Coreference Resolution

206

Related WorkApply a weakly supervised or unsupervised learning

algorithm to pronoun resolution

co-training (Müller et al., 2002)

self-training (Kehler et al., 2004)

Page 207: Unsupervised Models for                               Coreference Resolution

207

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Heuristics

Page 208: Unsupervised Models for                               Coreference Resolution

208

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

How to compute the semantic class of a mention?

Page 209: Unsupervised Models for                               Coreference Resolution

209

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

How to compute the semantic class of a mention? Proper names: use a named entity recognizer Nominals: induced from an unannotated corpus

Page 210: Unsupervised Models for                               Coreference Resolution

210

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Page 211: Unsupervised Models for                               Coreference Resolution

211

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Given a large, unannotated corpus

Use a parser to extract appositive relations <Eastern Airlines, carrier>, <George Bush, president>, …

Use a named entity recognizer to find the semantic classes of the proper names

Infer the semantic class of a nominal from the associated proper name

Page 212: Unsupervised Models for                               Coreference Resolution

212

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

Page 213: Unsupervised Models for                               Coreference Resolution

213

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

To improve robustness:1. Compute the probability that the nominal co-occurs with each

of the named entity types

2. If the most likely NE type has a probability above 0.7, label the nominal with the most likely NE type

Page 214: Unsupervised Models for                               Coreference Resolution

214

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

Page 215: Unsupervised Models for                               Coreference Resolution

215

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

Page 216: Unsupervised Models for                               Coreference Resolution

216

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

Page 217: Unsupervised Models for                               Coreference Resolution

217

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

Similar performance trends across the 2 scoring programs

Page 218: Unsupervised Models for                               Coreference Resolution

218

Experiments using Perfect MentionsPerfect mentions are NPs marked up in the answer key

using them makes the coreference task somewhat easier

Similar performance trends observedexcept that the unsupervised models perform comparably to

the fully-supervised resolver

Conclusions drawn from system mentions are not always generalizable to perfect mentions and vice versa

Page 219: Unsupervised Models for                               Coreference Resolution

219

SummaryPresented an EM-based model for unsupervised

coreference resolution that outperforms Haghighi and Klein’s coreference model

compares favourably to a modified version of their model

Page 220: Unsupervised Models for                               Coreference Resolution

220

H&K’s Model: Salience ModelingEach entity/cluster is initially assigned a salience value of 0As we process the discourse, the salience value of each

entity will changeWhen we encounter a mention, we update the salience scores

(* 0.5 for each entity and add 1 to current entity)Then discretize the salience values

5 buckets: TOP, HIGH, MID, LOW, NONEUsing a separate corpus, estimate the probability of

P(mention type | Salience)where mention type can be pronoun, name, or nominal. E.g.,

P(pronoun | TOP) is a large value P(nominal | TOP) is a small value

model is sensitive to these estimated values

Page 221: Unsupervised Models for                               Coreference Resolution

221

Why Salience Modeling?Important for pronouns

For H&K, since they don’t use features like apposition, modeling salience may allow mentions in an appositive to be assigned the same cluster id.

Page 222: Unsupervised Models for                               Coreference Resolution

222

Parameter Initialization = 0.4 (true mention) and 0.7 (system mentions) concentration parameter: e-4

Page 223: Unsupervised Models for                               Coreference Resolution

223

Parameter Initialization

Uses one (labeled) document taken from the training set toinitialize the parameters of our EM-based modeldetermine the concentration parameter, , in H&K’s model

Page 224: Unsupervised Models for                               Coreference Resolution

224

Experiments with Perfect MentionsSimilar performance trends observed

except that the unsupervised models perform comparably to the fully-supervised resolver

Conclusions drawn from perfect mentions are not always generalizable to system mentions and vice versa

Results obtained using perfect mentions should not be compared against those obtained using system mentions

Page 225: Unsupervised Models for                               Coreference Resolution

225

Degenerate EM BaselineModel obtained after one iteration of EM

No parameter re-estimation on the unlabeled data

Page 226: Unsupervised Models for                               Coreference Resolution

226

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

Degenerate EM Baseline: MUC Results

Page 227: Unsupervised Models for                               Coreference Resolution

227

Degenerate EM Baseline: MUC ResultsBroadcast News Newswire

Experiments on System Mentions R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

large gain in recall and large drop in precision (over-clustering)

F-score increases for one data set and drops for the other

Page 228: Unsupervised Models for                               Coreference Resolution

228

EM-Based Model: MUC Results

In comparison to Degenerate EMlarge drop in recall, but larger gain in precisionF-score increases by 4-21%gains attributed to exploitation of unlabeled data

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

Page 229: Unsupervised Models for                               Coreference Resolution

229

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Page 230: Unsupervised Models for                               Coreference Resolution

230

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Degenerate EM Baseline performs the worst

Page 231: Unsupervised Models for                               Coreference Resolution

231

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

EM-based Model outperforms Heuristic Baseline

Page 232: Unsupervised Models for                               Coreference Resolution

232

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Addition of each extension yields improvements in F-score

Page 233: Unsupervised Models for                               Coreference Resolution

233

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Extended H&K system performs comparably with EM-based model

Page 234: Unsupervised Models for                               Coreference Resolution

234

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Unsupervised models lag performance of the supervised model

Page 235: Unsupervised Models for                               Coreference Resolution

235

Unsupervised Coreference as EM ClusteringDesign a generative model that can be used to induce a

clustering of the mentions in a given document

Exploit pairwise linguistic constraints gender and number agreement, semantic compatibility, …

Page 236: Unsupervised Models for                               Coreference Resolution

236

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Facilitates the incorporate of pairwise linguistic constraints

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

Valid Invalid

Page 237: Unsupervised Models for                               Coreference Resolution

237

Strong Coreference Indicators

String match Alias (one is an acronym or abbreviation of the other) Appositive

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Proper, Common }

Use 7 linguistic features

Features

Page 238: Unsupervised Models for                               Coreference Resolution

238

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Page 239: Unsupervised Models for                               Coreference Resolution

239

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Page 240: Unsupervised Models for                               Coreference Resolution

240

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Search for the N most probable clusterings only

Page 241: Unsupervised Models for                               Coreference Resolution

241

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structure the search space as a Bell tree

Page 242: Unsupervised Models for                               Coreference Resolution

242

A Bell Tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Page 243: Unsupervised Models for                               Coreference Resolution

243

The Bell-Tree Search AlgorithmFinds the N most probable paths from the root to a leaf

using a beam search

The probability of a clustering (or partition) is the probability assigned to the corresponding path

Page 244: Unsupervised Models for                               Coreference Resolution

244

Degenerate EM Baselinemodel that is obtained after one iteration of EM

initializes model parameters based on labeled documentapplies the model (and Bell tree search) to obtain the most

probable coreference partition

no parameter re-estimation on the unlabeled data

Page 245: Unsupervised Models for                               Coreference Resolution

245

Noun Phrase CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Partition the set of mentions into coreference equivalence classes

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. A renowned

speech therapist, was summoned to help the King

overcome his speech impediment...

Page 246: Unsupervised Models for                               Coreference Resolution

246

Supervised Coreference Resolution

Lots of prior work on supervised coreference resolutionSoon et al. (2001), Strube et al. (2002), Yang et al. (2003),

Luo et al. (2004), Denis and Baldridge (2007), …

Page 247: Unsupervised Models for                               Coreference Resolution

247

1 2 3 4 5

1

2

3

4

5

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Reflexivity

Page 248: Unsupervised Models for                               Coreference Resolution

248

Approximating the E-step

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structures the search space as a Bell tree takes as input the pairwise coreference probabilities scores a clustering based on these probabilities

Page 249: Unsupervised Models for                               Coreference Resolution

249

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mentionensures transitivity automatically

Nonparametric Bayesian modeldoes not commit to a particular set of parameters

Page 250: Unsupervised Models for                               Coreference Resolution

250

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

{ Coref, Not Coref }c

Page 251: Unsupervised Models for                               Coreference Resolution

251

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

Scoring programs: recall, precision, F-measureMUC scoring program (Vilain et al., 1995)CEAF scoring program (Luo, 2005)CEAF variant

same as CEAF, but ignores singleton clusters

Experimental Setup

Page 252: Unsupervised Models for                               Coreference Resolution

252

Experimental SetupThe ACE 2003 coreference corpus

3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

Page 253: Unsupervised Models for                               Coreference Resolution

253

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 254: Unsupervised Models for                               Coreference Resolution

254

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 255: Unsupervised Models for                               Coreference Resolution

255

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Page 256: Unsupervised Models for                               Coreference Resolution

256

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

Page 257: Unsupervised Models for                               Coreference Resolution

257

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123]

Page 258: Unsupervised Models for                               Coreference Resolution

258

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [1][2][3]

Page 259: Unsupervised Models for                               Coreference Resolution

259

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

Page 260: Unsupervised Models for                               Coreference Resolution

260

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

Page 261: Unsupervised Models for                               Coreference Resolution

261

3 mentions: 1, 2, 3

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

Iterate till convergence

Page 262: Unsupervised Models for                               Coreference Resolution

262

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

Iterate till convergence

How to cope with the computational complexity

of the E-step?

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

Page 263: Unsupervised Models for                               Coreference Resolution

263

Goals

Design a new model for unsupervised coreference resolution

Improve Haghighi and Klein’s model with three modifications

Page 264: Unsupervised Models for                               Coreference Resolution

264

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Page 265: Unsupervised Models for                               Coreference Resolution

265

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Can we improve performance by combining labeled and unlabeled data?

Page 266: Unsupervised Models for                               Coreference Resolution

266

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

EM-based Generative Model H&K’s Generative Model

For each mention, guess the cluster id according to P(cluster id)

Generate feature values

Create mention pairsFor each pair, guess whether it

is COREF or NOT COREF according to P(COREF)

Generate feature values

Page 267: Unsupervised Models for                               Coreference Resolution

267

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

1 iiclusterinalreadymentionsofnumber

higher probability for larger clusters

number of mentions already in cluster i

Page 268: Unsupervised Models for                               Coreference Resolution

268

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

Probability of generating some new cluster id is:

1 iiclusterinalreadymentionsofnumber

1 i

number of mentions already in cluster ihigher probability for larger clusters

Page 269: Unsupervised Models for                               Coreference Resolution

269

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

Page 270: Unsupervised Models for                               Coreference Resolution

270

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

Find the best matching using the Hungarian Algorithm

Page 271: Unsupervised Models for                               Coreference Resolution

271

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Find the best matching using the Hungarian Algorithm

Page 272: Unsupervised Models for                               Coreference Resolution

272

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Matching score = 6

Find the best matching using the Hungarian Algorithm

Page 273: Unsupervised Models for                               Coreference Resolution

273

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Matching score = 6

Recall = 6 / 9 = 0.66

Prec = 6 / 12 = 0.5

F-measure = 0.57

Find the best matching using the Hungarian Algorithm

Page 274: Unsupervised Models for                               Coreference Resolution

274

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

Page 275: Unsupervised Models for                               Coreference Resolution

275

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

Page 276: Unsupervised Models for                               Coreference Resolution

276

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Page 277: Unsupervised Models for                               Coreference Resolution

277

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

The probability of generating a particular cluster id is based on some distribution that specifies P(id=1), P(id=2), P(id=3), … but we don’t know the number of clusters a priori don’t know how many probabilities to specify for distribution a distribution over an unknown number of clusters

Page 278: Unsupervised Models for                               Coreference Resolution

278

Dirichlet Process

Generate new cluster ids as needed

Page 279: Unsupervised Models for                               Coreference Resolution

279

Dirichlet Process

Generate new cluster ids as needed

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 ?

Page 280: Unsupervised Models for                               Coreference Resolution

280

Dirichlet Process

Generate new cluster ids as needed

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 ?

Should we generate id 1 or 2, or should we generate a new id 3?

Page 281: Unsupervised Models for                               Coreference Resolution

281

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

Page 282: Unsupervised Models for                               Coreference Resolution

282

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

higher probability for larger clusters

Page 283: Unsupervised Models for                               Coreference Resolution

283

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster I

Probability of generating some new cluster id is proportional to some constant α

higher probability for larger clusters