coreference resolution seminar by satpreet arora (07d05003)

Coreference Resolution

Seminar by

•Satpreet Arora (07D05003)

What is Coreference?In linguistics, co-reference occurs

when multiple expressions in a sentence or document refer to the same entity.

Example:Aditya went to Videorec to buy a DVD for himself. He had frequented the store for many years now.Here, Aditya, himself and He are coreferent. Also Videorec and store are coreferent.

Coreference Resolution.Determine which entities in a

document/discourse have the same referent.

In NLP, we usually deal with Coreference resolution of NPs.

The coreference system has to form equivalence classes of NPs that have the same real-world entity as its referent.

Coreference Relation is both transitive and symmetric.

A Coreference System Input of a coreference system:

John Simon, Chief Financial Officer of Prime Corp. since 1986, saw his pay jump 20%, to $1.3 million, as the 37-year-old also became the financial-service company’s president.

Output of a coreference system:[JS John Simon], [JS Chief Financial Officer] of [PC Prime Corp.] since 1986, saw [JS his] pay jump 20%, to $1.3 million, as [JS the 37-year-old] also became the [PC financial-service company]’s [JS president]

Equivalence classes:JS: {John Simon, Chief Financial Officer, his, the 37-year-old,president}PC: {Prime Corp., financial-service company}

Anaphora ResolutionAnaphora refers to the linguistic phenomenon

of having a noun phrase refer to a previously mentioned entity in a text for its semantic interpretation In other words, a pair of NPs <npi, npj> constitutes an anaphoric relationship if i < j and npj depends on npi for its interpretation, where npk denotes the kth NP in a document. For instance, the NP pair <Queen Elizabeth, her> forms an anaphoric relationshipin our example.

Queen Elizabeth set about transforming herhusband

Coreference Resolution - an Important ProblemApplications include: Question answering systems and Information

Retrieval:

Consider the query- Where was Mozart born?. A question answering system may first retrieve the sentence He was born in Salzberg from a document talking about Mozart. In this case, the system will return the correct answer if it can determine that the pronoun ‘He’ is coreferent with Mozart.

Machine Translation. Anaphora resolution comes into play when

discrepancies exist between the two languages with respect to anaphor selection. For example, a pronoun in the Malay language is often translated directly by its antecedent (Mitkov, 1999).

Coreference Resolution - an Important Problem (cont’d) Text SummarizationText summarisation tools using coreference

resolution not only include in the summary those sentences that contain a term appearing in the query, they also incorporate sentences containing a noun phrase that is coreferent with a term occurring in a sentence already selected by the system.

Cross-document coreference is particularly useful for text summarization systems that need to identify and merge the same piece of information about an entity mentioned in different documents in order to avoid repetition.

Coreference Resolution - a Hard Problem

The difficulty of the problem lies in its dependence on sophisticated semantic and world knowledge.

The policemen refused the women a permit for the demonstration because they feared violence.

The policemen refused the women a permit for the demonstration because they advocated violence.

Observe how they is referring two two different entities in the two sentences depending upon the context. Its easy for humans but difficult for machines.

Coreference Resolution - a Hard Problem (cont’d)Many sources of information play a role:Lexical information such as head noun

matches (as in Lord Spencer and Mr. Spencer) is an indicator of coreference, although it is not an absolute indicator (e.g. Lord Spencer and Diana Spencer are not coreferent).

Knowledge sources such as gender and number, semantic class, discourse focus, and world knowledge also play a role in determining whether two discourse entities are coreferent.

Coreference Resolution - a Hard Problem (cont’d)No single source of knowledge is completely

reliable indicator:For example, two semantically compatible NPs

are potentially coreferent (e.g. Diana Spencer and the princess) but whether the NPs are actually coreferent depends on other factors (such as contextual information).

Linguistic constraints indicating (non-)coreference such as number (dis)agreement is not absolutely hard (e.g. the singular NP assassination (of her bodyguards) can be coreferent with the plural NP these murders.

Coreference Resolution - a Hard Problem (cont’d) Coreference strategies differ depending on the

type of NP: Definite NPs are more likely to be anaphoric

than their non-definite counterparts (e.g. the article immediately preceding man in the sentence “Diana saw the/a photographer following her secretly” determines whether the NP has an existential or definite reading).

Pronoun resolution is difficult because resolution strategies differ for each type of pronouns (e.g. reflexives versus possessives) and also because some pronouns such as pleonastic pronouns are semantically empty (e.g. the pronoun it in the sentence “Camilla went outside and it was raining ” is pleonastic).

The AlgorithmA lot of different algorithms use different approaches to solve the problem. However, they all share some basic components. (for non machine learning based algorithms)Step 1: Identification of Discourse Entities (NPs)Step 2: Representation of NPs (as a set of features);Step3: Calculating distances between NPs using the distance metricStep 4:Creating equivalence classes using a clustering algorithm or other classification tools.

IDENTIFICATION OF DISCOURSE ENTITIES•For coreference resolution algorithms, the task is to identify all of the noun phrases in the text.•Textual elements which can be definite noun phrases, demonstrative noun phrases, proper names, appositives, sub-noun phrases that act as modifiers, pronouns and so on.•The basic structure of the identification is as follows:

Identification of Discourse Entities (cont’d)Tokenization and Morphological Processing:

Splitting the text to sentences and stemming words to their root form

POS tagging: Hidden Markov Model based statistical POS tagger.

Noun Phrase Identification: Determines noun phrase boundaries based on POS tagger.

Named Entity recognition: May also be HMM based, learns from a tagged corpus of named entities. If there are overlaps – boundaries are adjusted.

Nested Noun Phrases Extraction : Accepts noun phrases and determines the nested phrases (if any)

Representation of NPs Representation of NPs : A set of features used to construct the feature vector.

individual words in the NP; head noun: last word of the NP; position in the document; pronoun type: nominative, accusative, possessive,

ambiguous; article: indefinite, definite, none; appositive: based on heuristics (commas, etc.); number: plural, singular; proper name: based on heuristics (capitalization, etc.); semantic class based on Wordnet; gender: masculine, feminine, either, neuter; animacy: based on semantic class.

Representation of NPs (cont’d)

Calculating distances between NPsDifferent algorithms use different distance

metrics. We here present one from Cardie et al. (1999). Noun phrase coreference as clustering’ and the corresponding clustering algorithm.

The distance between noun phrases NP1 and NP2 is defined as:

dist(NP1,NP2) =∑ wf * incompatibilityf(NP1,NP2)The summation is over f ∈ FF: set of featureswf: weight of feature fincompatibilityf: degree of incompatibility

between NP1 and NP2

Calculating distances between NPs (cont’d)

ClusteringProperties of the clustering algorithm:start from the end of the document and work

backwards; if distance between two NPs is less than r , then

their equivalence classes are considered for merging;

classes can be merged unless they contain incompatible NPs;

the algorithm automatically computes the transitive closure of the coreference relation;

two NPs can be coreferent even if dist(NP1,NP2) > r, as long as dist(NP1,NP2) ≠∞;

r is a free parameters of the algorithm.

Clustering

Machine Learning algorithmsIn the algorithm we just saw, the weights of

each feature are fixed manually. Unlike manual approaches, machine learning

approaches to coreference resolution induce a model that determines the probability that two NPs are coreferent from annotated data automatically via the use of learning algorithms.

They can be characterized in terms of the knowledge sources being employed, the method of training data creation, as well as the learning algorithm and the clustering algorithm being chosen.

Machine Learning algorithms Training Data Creation: Positive instances:Two different methods are used to

create positive training instances: from Aone et al. (1995)1)Transitive (i.e. an instance is formed between an NP and

each of its preceding NPs in the same anaphoric chain) and

2)Non-transitive (i.e. an instance is formed between an NP and its closest preceding NP in the same anaphoric chain

Negative instances:1) Negative instances are generated by pairing an NP with

each preceding NP that does not have an anaphoric relationship with it.- Aone et al. (1995)

2) To reduce the ratio of negative instances to positive instances a negative instance is created by pairing an anaphoric NP, npj , with each NP appearing between npj and its closest preceding antecedent-Soon et al. (2001)

Learning Algorithm A lot of recent research involving machine learning

techniques use decision trees for classifying NP pairs. Soon et al. used C5 tree classifier.

Learning Algorithm (cont’d)Each pair of NPs is presented to the

classifier and the classifier returns a probability or certainty value of the pair being coreferent.

All pairs which receive a probability value greater than a threshold value are considered as being coreferent.

The algorithm then constructs the transitive closure of all the pairs and thus achieves partitioning of all the NPs into coreferent classes.

Clustering algorithms are also used in machine learning approaches.

The relative importance of all the factors discussed previously is learnt from the training corpus instead of being fixed manually. This allows for consideration of a larger number of factors.

In principle, learning-based systems are more robust than knowledge-based systems in the face of noisy input (meaning there are exceptions to rules). Also machine learning algorithms adapt easier to different topics.

Conclusion Machine learning approaches to coreference

resolution have been shown to be a promising way to build robust coreference resolution systems.

Despite the successful application of machine learning techniques to coreference resolution, the problem is far from being solved.

Linguistics combined with machine learning techniques can prove effective in solving the coreference problem.

Coreference resolution is one of the most difficult problems in language understanding. Given that NLP is often said to be “AI-complete” we might be able to conclude that coreference resolution is among the hardest of the hardest problems in artificial intelligence.

References Vincent N G (2002):Machine Learning for Coreference

Resolution:Recent Successes and Future Challenges Cardie, Claire and Kiri Wagstaff. 1999. Noun phrase

coreference as clustering. Byron, D. and J. Tetreault, 1999. A Flexible Architecture

for Reference Resolution. In Proc. of the 9th EACL. Unsupervised Models for Co-reference Resolution

(Vincent Ng 2008). Wee Meng Soon, Daniel Chung Yong Lim, Hwee Tou

Ng(2001) DSO National Laboratories :A Machine Learning Approach to Coreference Resolution of Noun Phrases

http://www.cs.tau.ac.il/~nachumd/NLP/2010/Anaphora.pdf

http://www.inf.ed.ac.uk/teaching/courses/nlu/lectures/nlu_l16.pdf

http://www.dfki.de/~loeckelt/ss2010/presentations/coreference_resolution.pdf

Wikipedia







Questions

coreference resolution seminar by satpreet arora (07d05003)

Documents