anaphora and coreference resolution for information extraction from molecular biology texts bioi...

125
Anaphora and coreference resolution for information extraction from molecular biology texts BIOI 7791 April 7, 2005 © Kevin Cohen

Post on 21-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Anaphora and coreference resolution for information extraction from molecular

biology textsBIOI 7791

April 7, 2005© Kevin Cohen

Examples not otherwise cited are from:

• Mitkov, Ruslan (2002) Anaphora resolution pp. 117-118 Pearson Education, Ltd.

• Kehler, Andrew (2002) Coherence, reference, and the theory of grammar

• Kehler, Andrew (2000) Discourse. In Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (Jurafsky and Martin 2000)

• Jackson, Peter; and Isabelle Moulinier (2002) Natural language processing for online applications: text retrieval, extraction and categorization pp. 189-190

• Manning, Christopher; and Heinrich Schuetze (1999) Foundations of statistical natural language processing

How do you know what it means?

it

Pronouns

What people usually mean: personal, possessive, and reflexive

I, me, my, mine, myselfWe, us, our, ours, ourselvesYou, you, your, yours, yourselfYou, you, your, yours, yourselvesHe, him, his, his, himselfShe, her, her, hers, herselfIt, it, its, ---, itselfThey, them, their, theirs, themselves

Pronouns

Others, too• Reciprocal (each other, one another)• Demonstrative (this, that, these, those)• Indefinites (every(body|one|thing)*,

some(body|one|thing), any(etc)*, no(body| one|thing))

• One (one/ones)• Wh-pronouns, quantifiers, others, the

latter, etc.

Anaphor

• Has no meaning• Meaning comes from earlier string

(antecedent)

Reference vs. antecedence

• Referent: entity that it “means”• Antecedent: string that “points to”

that entity• In practice…

Coreference

• Referring to the same entity• 7037:12406888|TfR and TfR2 have

similar cellular localizations in K562 cells and coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.

Coreference

• TfR and TfR2 have similar cellular localizations in K562 cells and Ø coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.

Coreferent anaphora(SUBJ antecedent, SUBJ pro)• 328:12230304|APE/REF1 is increased &

competent in the brain & spinal cord of individuals with amyotrophic lateral sclerosis. It is upregulated in spinal cord astrocytes& white matter pathways in familial ALS.

• 412:15009711|SSase is concentrated in lamellar bodies (LB), and secreted into the SC interstices, along with other LB-derived lipid hydrolases. There, it degrades CSO4, generating some cholesterol for the barrier

Coreferent anaphora(SUBJ antecedent, SUBJ pro)• 641:12826610|Human BLM interacts with both

scDna2 and scFEN1. It may participate in the same steps of DNA replication or repair as scFEN1 & scDna2, acting as a molecular matchmaker at a crossroad between replication & repair.

• 650:12819188|expression of the mature BMP-2 protein is disregulated in the majority of NSCLC. BMP-2 enhancement of tumor cell migration and invasion, as well as stimulating tumor growth in vivo, suggests it has important biological activity in lung carcinomas.

• 840:12145703|Pro-CASP7 was detected in mitochondria, cytosol, nucleus, and microsomes of U937 cells. During TPA-induced differentiation, it moved to the mitochondria.

• 1230:14595653|neuronal CCR1 is not a generalized marker of neurodegeneration. Rather, it appears to be part of the neuroimmune response to Abeta42-positive neuritic plaques

Coreferent anaphora(SUBJ antecedent, SUBJ pro)

• 1641:15057976|DCX maps at Xq22.3 and is caused by a homozygous mutation. It acts during corticogenesis on radial migratory pathways.

Coreferent anaphora(SUBJ antecedent, SUBJ pro)

Coreferent anaphora

• 19:11950847|Helical apolipoproteins stabilize ATP-binding cassette transporter A1 by protecting it from thiol protease-mediated degradation.

• 319:12907677|LTIP tailors CETP-mediated remodeling of HDL3 and HDL2 particles in subclass-specific ways, strongly implicating it as a regulator of HDL metabolism

Coreferent anaphora

• 321:12780348|hXB51 isoforms regulate Abeta generation differently, either enhancing it by modifying the association of X11L with APP or suppressing it in an X11L-independent manner

• 328:12569263|Repression of renin expression by intracellular calcium may be mediated by the calcium-induced translocation of Ref-1 to the nucleus, where it binds to the renin promoter nCaRE, to repress the transcription of the renin gene.

Coreferent anaphora

• 382:12902467|Localization of the GTP-bound form of ARF6 at the plasma membrane makes it a candidate marker for the identification of anergic T cells; T cells with distinct membrane localization of ARF6 are detected in peripheral blood of healthy individuals.

Noncoreferent anaphor

• Pleonastic• Set/group membership• Bound anaphora• Antecedent is not a “thing”

My neighbor played the trumpet all night long. It drove me crazy.

816:11710563|. Four distinct isoforms of CAMKII were isolated. Two of them were characterized as CaMKII alpha and beta subunits.expression is developmentally regulated in both human fetal and adult brain to different degrees

• Empty subject/object: 9:12692115|It is unlikely that the NAT1*10 or NAT2 rapid/intermediate genotypes are related to stomach cancer risk.

• 347:14596852|In a study comparing brains from Alzheimer's patients and controls, it was found that hippocampal apolipoprotein D level depends on Braak stage and APOE genotype.

• Cleft: It is sustained cyclin D1 expression at high cell density which correlates with v-Ras-induced focus-forming activity.

Coreferent and pleonastic it in GeneRIFs

Nonanaphoric coreference

7037:12406888|TfR and TfR2 have similar cellular localizations in K562 cells and coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.

Nonanaphoric

Anaphor, cataphor, exophor

• Takes meaning from something else, and…

• Anaphor: referent precedes it (antecedent)

• Cataphor: referent follows it• Exophor: referent "outside" of

discourse you, I

Cataphoric reference

• 116:11713978|Besides their presence and functions in the gut and the brain VIP and PACAP have distinct physiological roles in the genital tract.

Exophoric reference

• 580:11943588|It is structurally homologous to BRCA1 (it shares the conserved RING finger and BRCT domains); may be involved in tumor suppression because BARD1-BRCA1 complexes in ubiqutination of RNA Pol II and BARD1 interacts with CstF-50 (inhibiting mRNA processing).

Exophoric reference

• 3685:11872628|They analyzed the expression of these cell surface receptors in nine ovarian cancer cell lines and also in the primary human ovarian surface epithelial cell line (HOSE).

Exophoric reference

• 811:11891802|location was observed on or near the cell surface suggesting it might participate in surface membrane transport of iron

Exophoric reference

• 1102:12115502|decreased expression in prostate cancer associated with the difference in frequency of variant isoforms between normal and neoplastic prostate tissues places it in a pivotal role or possibly adjacent to a gene with that role in prostate cancer evolution

Exophoric reference in GeneRIFs

Anaphor, anaphora, anaphors...

• Anaphor: see earlier• anaphora: the phenomenon;

plural?• Anaphors: plural?

Finally, we know what the title means (almost)

• “resolving” an anaphor: finding its referent/antecedent

Why people care

• Classic: "text understanding"• Information extraction, information

retrieval, summarization…

…and now it’s time for some controversy

• Linguistic school of thought: syntax is hugely important– Chomsky’s “GB” (Government and

Binding)

• AI school of thought: semantics/world knowledge is hugely important

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

John

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

Bill

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

Bill

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

Grammatical role hierarchy

Why people think syntax matters

• John kicked Bill. Mary told him to go home.

• Bill was kicked by John. Mary told him to go home.

• John kicked Bill. Mary punched him.

Grammatical role

parallelism

Why people think knowledge matters

The city council denied the demonstrators a permit because they {feared|advocated} violence.

Why people think knowledge matters

The city council denied the demonstrators a permit because they {feared|advocated} violence.

Why people think knowledge matters

The city council denied the demonstrators a permit because they {feared|advocated} violence.

Syntax can’t be all there is

• John hit Bill. He was severely injured.

Margaret Thatcher admires Hillary Clinton, and George

W. Bush absolutely worships her.

Since there’s no phonological distinction

between O and IO pronouns, you sometimes

have to wait and see whether another

argument shows up before you know how to interpret

the PRO• Krusty shot him.• Krusty shot him a glance.“With only one post-verbal argument, it’s

a felony; with two, it’s non-verbal communication.”

--Erin Shay

Why I care

• As a developer: Information extraction

• As a researcher: Novel in this domain

• As a member of CCP: – Knowledge required– Could probably do it with DMAP

Motivation from information extraction

Src directly phosphorylates integrins and can also modulate R-Ras activity. Moreover, it stimulates the E-cadherin regulator Hakai, interacts with and phosphorylates the novel podosome-linked adaptor protein Fish, and progressively phosphorylates the gap junction component connexion 43.(Frame 2004, PMID 14996930)

Motivation from information extraction

Motivation from information extraction

How much, exactly, would you need to know?

Knowledge

The city council refused the marchers a permit because they {feared|advocated} violence.

If V = feared then they = the city councilIf V = advocated then they = the marchers

Knowledge

• 890:11907280|found that cyclin A and cyclin E are able to regulate both nuclear and cytoplasmic events because they both shuttle between the nucleus and the cytoplasm

• Candidate antecedents:– cyclin A and cyclin E– nuclear and cytoplasmic events– cytoplasmic events

Knowledge

• 7015:12839932|Peptide nucleic acids are taken up into human tumor cells when they are components of a peptide fragment of this enzyme.

• Candidate antecedents:– peptide nucleic acids– human tumor cells

• 652:11867524|Bmp-4 only activates Dkk-1 when it concomitantly induces apoptosis. Implanted recombinant human Bmp-4 beads abolish Dkk-1 transcription in chick limb buds and mouse embryo cells.

• Different from the previous—not class-based, but instance-based…

• A protein complex has a structural organization

• A gene can have an exon• A gene can have a polymorphism• An important role cannot have an exon or

a polymorphism• A gene can have an interaction with a

gene• A mutation cannot have an interaction

with a gene

So, how do people actually do this?

Uncontroversial part

• Three things you need to be able to do– POS tagging– Shallow parsing– Number agreement

• If there’s only one potential antecedent, it’s easy…

• …but if there’s more than one, then you have to make a choice.

Two or three broad categories of solutions

• Search• Heuristic

– symbolic– numerical

• Machine learning

Rule-based

Explicit discourse

model

How good people are at caring

• 70-80%• …with caveats

What actually gets tackled in this paper?

• Evaluation set• Exclusions

A search-based solution

• Hobbs 19??: Resolving pronoun references

Hobbs 19what??

• 1976: tech report• 1978: published in Lingua• 1986: reprinted in Grosz, Sparck-

Jones, and Webber, Readings in Natural Language Processing

Hobbs 1978

• Assessment of difficulty of problem• Incidence of the phenomenon• A simple algorithm that has

become a baseline (sorta)• Hobbs distance: ith candidate NP

considered by the algorithm is at a Hobbs distance of i

Caveat about Hobbs 1978

• It never happened.

Hobbs’s point

…the naïve approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well, and these results set a very high standard for any other approach to aim for.

Hobbs’s point

Yet there is every reason to pursue a semantically based approach. The naïve algorithm does not work. Any one can think of examples where it fails. In these cases it not only fails; it gives no indication that it has failed and offers no help in finding the real antecedent.

(p. 345)

A parse tree

The NP vs. N’ distinction

A student of linguisticsA student of little promiseA student of linguistics of little

promise* A student of little promise of

linguistics

The NP vs. N’ distinction

[A student [of linguistics]][A student] [of little promise][A student [of linguistics] [of little

promise]* A student of little promise of

linguistics

The NP vs. N’ distinction

[A student [of linguistics]N’]

[A studentN’] [of little promise]

[A student [of linguistics]N’] [of little promise]

* A student of little promise of linguistics

The NP vs. N’ distinction

[[A student [of linguistics]N’]NP]

[[A studentN’] [of little promise]NP]

][A student [of linguistics]N’] [of little promise]NP]

* [A student of little promise of linguisticsNP]

The NP vs. N’ distinction

• Step (6) refers to NP (N”) and N’.• Resolution of Localization of the

GTP-bound form of ARF6 at the plasma membrane makes it…

Selectional restrictions

• Buildings don’t move.• Dates don’t move.• Cities don’t move.

The algorithm (my best guess)

• (1-3): conditional on intervening S or NP keeps you from positing driver for Fig. 1(b), but allows it for Fig. 1(a)

• (4): If you’re going to next sentence, you’re picking out subjects preferentially

• (5-6): conditional on path keeps you from positing driver for Fig. 1(b), but allows it for Fig. 1(a)

• (7): Prefer subject within the current sentence• (8): If this is a cataphor, you need to look to the

right, but don’t go into the next clause

The algorithm (was I right?)

See pp. 341-342

The algorithm: practice

• Kevin bought a cookie and ate it.• Kevin bought a cookie. He ate it.• Kevin bought a cookie. It was yummy.• The dog found a biscuit and ate it.• The dog found a biscuit. It ate it.• The dog found a biscuit. It was

yummy.

The algorithm: evaluation

• What actually gets tackled in this paper?– Evaluation set: he, she, it, they– Exclusions:

The algorithm: evaluation

• Corpus:– Early civilization in China (book, non-

fiction)– Wheels (book, fiction)– Newsweek (magazine, non-fiction)

• First 100 consecutive pronouns from each

The algorithm: results

• Overall, no selectional constraints: 88.3%

• Overall, with selectional constraints: 91.7%

• ????

The algorithm: results

• This is somewhat deceptive since in over half the cases there was only one nearby plausible antecedent. (p. 344)

• 132/300 times, there was a conflict• 12/132 resolved by selectional

constraints, 96/120 by algorithm• Thus, 81.8% of the conflicts were

resolved by a combination of the algorithm and selection.

The algorithm: results

Hobbs’s point

…the naïve approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well, and these results set a very high standard for any other approach to aim for.

Hobbs’s point

Yet there is every reason to pursue a semantically based approach. The naïve algorithm does not work. Any one can think of examples where it fails. In these cases it not only fails; it gives no indication that it has failed and offers no help in finding the real antecedent.

(p. 345)

Adaptation for shallow parse (Kehler et al.)

Shallow parse: lowest-level constituents only; for coref, “base NP’s” (noun and all modifiers to the left)

a good student of linguistics with long hair

The castle in Camelot remained the residence of the king until 536 when he moved it to London.

Adaptation for shallow parse

…noun groups are searched in the following order:

1. In current sentence, R->L, starting from L of PRO

2. In previous sentence, L->R3. In S-2, L->R4. In current sentence, L->R, starting

from R of PRO(Kehler et al. 2004:293)

Adaptation for shallow parse

1. In current sentence, R->L, starting from L of PRO

1. he: no AGR2. 536: dates can’t move3. the king: no AGR4. the residence: OK!

My modification for GeneRIFs

• Always start L->R, even in current sentence (unless reflexive, or even if reflexive??)

• Data:

Ge, Hale, and Charniak 1998

A statistical approach to anaphora resolution. Proceedings of the sixth workshop on very large corpora, pp. 161-170.

What can we take from this?

• Implementation details– modularity aids in analysis– a baseline: closest NP as referent

(ignoring even AGR)• Analysis

– incrementalism, or at least modularity• Reference

– This seems to be the origin of the notion of the "Hobbs distance"

Approaches to the coref problem

• Search• Rules

– symbolic– numerical

• Machine learning

Hobbs 1977 (last time)

Ge, Hale, and Charniak (this time)

Focus on features

• What they are• Why you might suspect they would be

helpful (constraints, restrictions, preferences)

• Things to think about– How would feature set differ if you were

looking at non-pronominal coreference?– Are features getting at

syntactic/semantic/discourse?

Syntactic/semantic/discourse-related features• Syntactic

– +/- PP– subj/obj– distance/path– gender/number AGR

• Semantic– verb ID– semantic class/entity identification– thematic role– MED

• Discourse– repeated mention

Not a partition...

John has an Integra. Bill has a Legend. Mary likes to drive

it.it.

The distance feature

• The recency preference: "…entities introduced in recent utterances are more salient than those introduced from utterances further back."– John has an Integra.

Bill has a Legend. Mary likes to drive it.

• Build syntactic parse tree

• Run HNA, but don't stop until you've found 15 candidates

• ith candidate is at dH = i

The agreement feature

• Constraint, but approximated probabilistically• "The probability that a referent is in a particular

gender class is just the relative frequency with which that referent is referred to by a pronoun p that is part of that gender class"

• Noisy:– my favorite aunt she– my favorite book it– P(she|aunt) = 1.0– P(she|favorite) 0.5– Dunning test to find most informative word

773:14534930|We report on a family with ataxia type 6 (SCA6) showing peculiar oculomotor symptoms. They carried the identical mutation (the number of expanded CAG repeat, 24) in the CACNA1A gene.

Head word feature

• I picked up the book and sat in a chair. It broke.

• If verb is eat then referent is probably food, if verb is drink then referent is probably liquid, etc.

• John needed a car to get to his new job.

• He decided that he wanted something sporty.

• Bill went to the Acura dealership with him.

• He bought an Integra.John

The repeated mention feature

• "…entities that have been focused on in the prior discourse are more likely to continue to be focused on in subsequent discourse…"– John needed a car to get to his new

job. He decided that he wanted something sporty. Bill went to the Acura dealership with him. He bought an Integra.

Evaluation

• Third-person singular pronouns (he, she, it) in various forms

• “Newswire” text, pre-parsed

Results

• 82.9% overall

• Hobbs distance: 65.3%• …plus gender and animacy: 75.7%• …plus head feature: 77.9%• …plus repeated mention: 82.9%

This is knowledge!

So is this—but, is/has…

Carping

• Level of detail on inputs is not good• Level of detail on results is not good

– What if the "increments" were ordered differently? Were applied singly? Seems pretty important….

• Implementation detail sometimes vague

What can we take from this?

• Implementation details– modularity aids in analysis– a baseline: closest NP as referent

(ignoring even AGR)• Analysis

– incrementalism, or at least modularity• Reference

– This seems to be the origin of the notion of the "Hobbs distance"

What other people have said about this paper

• Jackson and Moulinier (2002)• Mitkov (2002)

Jackson & Moulinier (2002)An alternate approach to hand-written heuristic rules is to have a program learn preferences among antecedents from sample data. Researchers at Brown University used a corpus of Wall Street Journal articles marked with coreference information to build a probabilistic model for this problem. This model then informs an algorithm for finding antecedents for pronouns in unseen documents. The model considers the following factors when assigning probabilities to pronoun-antecedent pairings.

Distance between the pronoun and the proposed antecedent, with greater distance lowering the probability. Hobbs's Naïve Algorithms used to gather candidate antecedents, which are then rank ordered by distance. The probability that the correct antecedent lies at distance rank d from the pronoun is then computed from corpus statistics as (the number of correct antecedents at distance d) / (the total number of correct antecedents).Mention count. Noun phrases that are mentioned repeatedly are preferred as antecedents. As well as counting mentions of referents, the authors make an adjustment for position of the pronoun in the document. The later in the document a pronoun occurs, the more likely it is that its referent will have been mentioned multiple times.Syntactic analysis of the context surrounding the pronoun, especially where reflexive pronouns are concerned. Preferences for antecedents in the subject position and special treatment of reflexive pronouns are supplied by the Hobbs algorithm.Semantic distinctions, such as number, gender, and animate/inanimate, which make certain pairings unlikely or impossible. Given a training corpus of correct antecedents, counts can be obtained for such semantic features.

The probability that a pronoun corefers with a given antecedent is then computed as a function of these four factors, and the winning pair is the one that maximizes the probability of assignment. The authors performed an experiment to test the accuracy of the model on the singular pronouns ('he,' 'she,' and 'it'), and their various possessive and reflexive forms ('his,' 'hers,' 'its,' 'himself,' 'herself,' 'itself'). They implemented their model in an incremental fashion, enabling the contribution of the various factors to be analyzed. The results were quite interesting, and can be summarized as follows. Ignoring Hobbs's algorithm, and simply choosing the closest noun phrase as the referent, had a success rate of only 43%. Using the syntactic analysis afforded by the Naïve Algorithm increased accuracy to 65%. Adding semantic information, such as gender, raised the success rate to 76%. Adding additional information, such as mention counts, obtained a final increment to 83%. In restricting themselves to singular pronouns with concrete referents, the authors set out to solve a simpler problem that that addressed by CogNIAC [a rule-based system discussed earlier in the chapter,] but the results are still impressive. These are very common usages, and there is considerable utility for text mining in being able to analyze them accurately. (pp. 189-191)

Jackson & Moulinier (2002)

• They never mention the head feature

• They overstate the whole "syntactic analysis" thing—it's not a feature per se, except to the extent that it's included in the Hobbs distance and Hobbs distance is actually helpful

Mitkov (2002)Ge, Hale and Charniak (1998) propose a statistical framework for resolution of third person anaphoric pronouns. They combine various anaphora resolution factors into a single probability which is used to track down the antecedent. The program does not rely on hand-crafted rules but instead uses the Penn Wall Street Journal Treebank to train the probabilistic model. The first factor the authors make use of is the distance between the pronoun and the candidate for an antecedent. The greater this distance, the lower the probability for a candidate NP to be the antecedent. The so-called 'Hobbs's distance' measure is used in the following way. Hobbs's algorithm is run for each pronoun until it has proposed N (in this case N = 15) candidates. The Kth candidate is regarded as occurring at 'Hobbs's distance = K. Ge and co-workers rely on features such as gender, number and animacy of the proposed antecedent. Given the words contained in an NP, they compute the probability that this NP is the antecedent of the pronoun under consideration based on probabilities computed over the training data, which are marked with coreferential links. The authors also make use of co-occurrence patterns by computing the probability that a specific candidates occurs in the same syntactic function (e.g. object) as the anaphor. The last factor employed is the mention count of the candidate. Noun phrases that are mentioned more frequently have a higher probability of being the antecedent; the training corpus is marked with the number of times an NP is mentioned up to each specific point. The four probabilities discussed above are multiplied together for each candidate NP. The procedure is repeated for each NP and the one with the highest probability is selected as the antecedent. For more on the probabilistic model and the formulae used, see Ge et al. (1998). The authors investigated the relative importance of each of the above four probabilities (factors employed) in pronoun resolution. To this end, they ran the program 'incrementally,,' each time incorporating one more probability. Using only Hobbs's distance yielded an accuracy of 65.3%, whereas the lexical information about gender and animacy brought the accuracy up to 75.7%, highlighting the latter factor as quite significant. The reason the accuracy using Hobbs's algorithm was lower than expected was that the Penn Treebank did not feature perfect representations of Hobbs's trees. Contrary to initial expectations, knowledge about the governing constituent (co-occurrence patterns) did not make a significant contribution, only raising the accuracy to 77.9%. One possible explanation could be that selectional restrictions are not clear-cut in many cases; in additions, some of the verbs in the corpus such as is and has were not 'selective' enough. Finally, counting each candidate proved to be very helpful, increasing the accuracy to 82.9%.

Mitkov

• He says that they looked for parallelism—I don't see it.

Two or three broad categories of solutions

• Search• Heuristic

– symbolic– numerical

• Machine learning

Rule-based

Explicit discourse

model

CogNIAC

• 328:12230304|APE/REF1 is increased & competent in the brain & spinal cord of individuals with amyotrophic lateral sclerosis. It is upregulated in spinal cord astrocytes & white matter pathways in familial ALS.

• APE/REF1; it• the brain; it• spinal cord; it• individuals; they…• amyotrophic lateral

sclerosis; it

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• “Unique in discourse?”

• No

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• “If pronoun is reflexive (itself), pick most recent…”

• Vacuous

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• “Unique in current and prior sentences?”

• No

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• Possessive?

• No

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• Unique in current sentence?

• No

• APE/REF1; it• the brain; it• spinal cord; it• individuals;

they…• amyotrophic

lateral sclerosis; it

• Is there a possible match in subject position of preceding sentence, and is pronoun subject of its sentence?

• Yes!

Two or three broad categories of solutions

• Search• Heuristic

– symbolic– numerical

• Machine learning

Rule-based

Explicit discourse

model

A numerical/heuristic solution

• Castaño, Zhang, and Pustejovsky: Anaphora resolution in biomedical literature

• (Pre)processing and resources:– POS tagging– shallow parsing– acronym definition– semantic information from UMLS– pleonastic detection with regex

• Find the set of possible antecedents• If only one possible, then pick it• If more than one, then assign each

a score– Start with zero– Negative numbers are penalties– Highest number wins

A numerical/heuristic sol’n

Evaluation

• Test set: 46 abstracts, 116 examples

• P = .77, R = .71, F-measure = 73.8

Numerical/heuristic results

Andy Kehler’s big question about anaphora

If anaphora is hard to resolve, why do we use it so much??

References

• Mitkov, Ruslan (2002) Anaphora resolution pp. 117-118 Pearson Education, Ltd.

• Kehler, Andrew (2002) Coherence, reference, and the theory of grammar

• Kehler, Andrew (2000) Discourse. In Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (Jurafsky and Martin 2000)

• Jackson, Peter; and Isabelle Moulinier (2002) Natural language processing for online applications: text retrieval, extraction and categorization pp. 189-190

• Manning, Christopher; and Heinrich Schuetze (1999) Foundations of statistical natural language processing

Oddities of GRIFs with respect to coreference

resolution• Impose a very small scope on the

search space• Exophoric 3rd-person pronouns• Weird notion of “sentence” in

written language• All NP’s are pretty recent• 255 chars don’t let you repeat much

Next step…

• We’ve seen how to “ground” coreferential nouns to the text…How about grounding them to the world?

• Morgan (2005)