anaphora and coreference resolution for information extraction from molecular biology texts bioi...
Post on 21-Dec-2015
224 views
TRANSCRIPT
Anaphora and coreference resolution for information extraction from molecular
biology textsBIOI 7791
April 7, 2005© Kevin Cohen
Examples not otherwise cited are from:
• Mitkov, Ruslan (2002) Anaphora resolution pp. 117-118 Pearson Education, Ltd.
• Kehler, Andrew (2002) Coherence, reference, and the theory of grammar
• Kehler, Andrew (2000) Discourse. In Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (Jurafsky and Martin 2000)
• Jackson, Peter; and Isabelle Moulinier (2002) Natural language processing for online applications: text retrieval, extraction and categorization pp. 189-190
• Manning, Christopher; and Heinrich Schuetze (1999) Foundations of statistical natural language processing
Pronouns
What people usually mean: personal, possessive, and reflexive
I, me, my, mine, myselfWe, us, our, ours, ourselvesYou, you, your, yours, yourselfYou, you, your, yours, yourselvesHe, him, his, his, himselfShe, her, her, hers, herselfIt, it, its, ---, itselfThey, them, their, theirs, themselves
Pronouns
Others, too• Reciprocal (each other, one another)• Demonstrative (this, that, these, those)• Indefinites (every(body|one|thing)*,
some(body|one|thing), any(etc)*, no(body| one|thing))
• One (one/ones)• Wh-pronouns, quantifiers, others, the
latter, etc.
Reference vs. antecedence
• Referent: entity that it “means”• Antecedent: string that “points to”
that entity• In practice…
Coreference
• Referring to the same entity• 7037:12406888|TfR and TfR2 have
similar cellular localizations in K562 cells and coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.
Coreference
• TfR and TfR2 have similar cellular localizations in K562 cells and Ø coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.
Coreferent anaphora(SUBJ antecedent, SUBJ pro)• 328:12230304|APE/REF1 is increased &
competent in the brain & spinal cord of individuals with amyotrophic lateral sclerosis. It is upregulated in spinal cord astrocytes& white matter pathways in familial ALS.
• 412:15009711|SSase is concentrated in lamellar bodies (LB), and secreted into the SC interstices, along with other LB-derived lipid hydrolases. There, it degrades CSO4, generating some cholesterol for the barrier
Coreferent anaphora(SUBJ antecedent, SUBJ pro)• 641:12826610|Human BLM interacts with both
scDna2 and scFEN1. It may participate in the same steps of DNA replication or repair as scFEN1 & scDna2, acting as a molecular matchmaker at a crossroad between replication & repair.
• 650:12819188|expression of the mature BMP-2 protein is disregulated in the majority of NSCLC. BMP-2 enhancement of tumor cell migration and invasion, as well as stimulating tumor growth in vivo, suggests it has important biological activity in lung carcinomas.
• 840:12145703|Pro-CASP7 was detected in mitochondria, cytosol, nucleus, and microsomes of U937 cells. During TPA-induced differentiation, it moved to the mitochondria.
• 1230:14595653|neuronal CCR1 is not a generalized marker of neurodegeneration. Rather, it appears to be part of the neuroimmune response to Abeta42-positive neuritic plaques
Coreferent anaphora(SUBJ antecedent, SUBJ pro)
• 1641:15057976|DCX maps at Xq22.3 and is caused by a homozygous mutation. It acts during corticogenesis on radial migratory pathways.
Coreferent anaphora(SUBJ antecedent, SUBJ pro)
Coreferent anaphora
• 19:11950847|Helical apolipoproteins stabilize ATP-binding cassette transporter A1 by protecting it from thiol protease-mediated degradation.
• 319:12907677|LTIP tailors CETP-mediated remodeling of HDL3 and HDL2 particles in subclass-specific ways, strongly implicating it as a regulator of HDL metabolism
Coreferent anaphora
• 321:12780348|hXB51 isoforms regulate Abeta generation differently, either enhancing it by modifying the association of X11L with APP or suppressing it in an X11L-independent manner
• 328:12569263|Repression of renin expression by intracellular calcium may be mediated by the calcium-induced translocation of Ref-1 to the nucleus, where it binds to the renin promoter nCaRE, to repress the transcription of the renin gene.
Coreferent anaphora
• 382:12902467|Localization of the GTP-bound form of ARF6 at the plasma membrane makes it a candidate marker for the identification of anergic T cells; T cells with distinct membrane localization of ARF6 are detected in peripheral blood of healthy individuals.
Noncoreferent anaphor
• Pleonastic• Set/group membership• Bound anaphora• Antecedent is not a “thing”
My neighbor played the trumpet all night long. It drove me crazy.
816:11710563|. Four distinct isoforms of CAMKII were isolated. Two of them were characterized as CaMKII alpha and beta subunits.expression is developmentally regulated in both human fetal and adult brain to different degrees
• Empty subject/object: 9:12692115|It is unlikely that the NAT1*10 or NAT2 rapid/intermediate genotypes are related to stomach cancer risk.
• 347:14596852|In a study comparing brains from Alzheimer's patients and controls, it was found that hippocampal apolipoprotein D level depends on Braak stage and APOE genotype.
• Cleft: It is sustained cyclin D1 expression at high cell density which correlates with v-Ras-induced focus-forming activity.
Nonanaphoric coreference
7037:12406888|TfR and TfR2 have similar cellular localizations in K562 cells and coimmunoprecipitate to only a very limited extent. Western analysis of the receptors under nonreducing conditions reveals that they can form heterodimers.
Nonanaphoric
Anaphor, cataphor, exophor
• Takes meaning from something else, and…
• Anaphor: referent precedes it (antecedent)
• Cataphor: referent follows it• Exophor: referent "outside" of
discourse you, I
Cataphoric reference
• 116:11713978|Besides their presence and functions in the gut and the brain VIP and PACAP have distinct physiological roles in the genital tract.
Exophoric reference
• 580:11943588|It is structurally homologous to BRCA1 (it shares the conserved RING finger and BRCT domains); may be involved in tumor suppression because BARD1-BRCA1 complexes in ubiqutination of RNA Pol II and BARD1 interacts with CstF-50 (inhibiting mRNA processing).
Exophoric reference
• 3685:11872628|They analyzed the expression of these cell surface receptors in nine ovarian cancer cell lines and also in the primary human ovarian surface epithelial cell line (HOSE).
Exophoric reference
• 811:11891802|location was observed on or near the cell surface suggesting it might participate in surface membrane transport of iron
Exophoric reference
• 1102:12115502|decreased expression in prostate cancer associated with the difference in frequency of variant isoforms between normal and neoplastic prostate tissues places it in a pivotal role or possibly adjacent to a gene with that role in prostate cancer evolution
Anaphor, anaphora, anaphors...
• Anaphor: see earlier• anaphora: the phenomenon;
plural?• Anaphors: plural?
Finally, we know what the title means (almost)
• “resolving” an anaphor: finding its referent/antecedent
Why people care
• Classic: "text understanding"• Information extraction, information
retrieval, summarization…
…and now it’s time for some controversy
• Linguistic school of thought: syntax is hugely important– Chomsky’s “GB” (Government and
Binding)
• AI school of thought: semantics/world knowledge is hugely important
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
John
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
Bill
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
Bill
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
Grammatical role hierarchy
Why people think syntax matters
• John kicked Bill. Mary told him to go home.
• Bill was kicked by John. Mary told him to go home.
• John kicked Bill. Mary punched him.
Grammatical role
parallelism
Why people think knowledge matters
The city council denied the demonstrators a permit because they {feared|advocated} violence.
Why people think knowledge matters
The city council denied the demonstrators a permit because they {feared|advocated} violence.
Why people think knowledge matters
The city council denied the demonstrators a permit because they {feared|advocated} violence.
Since there’s no phonological distinction
between O and IO pronouns, you sometimes
have to wait and see whether another
argument shows up before you know how to interpret
the PRO• Krusty shot him.• Krusty shot him a glance.“With only one post-verbal argument, it’s
a felony; with two, it’s non-verbal communication.”
--Erin Shay
Why I care
• As a developer: Information extraction
• As a researcher: Novel in this domain
• As a member of CCP: – Knowledge required– Could probably do it with DMAP
Motivation from information extraction
Src directly phosphorylates integrins and can also modulate R-Ras activity. Moreover, it stimulates the E-cadherin regulator Hakai, interacts with and phosphorylates the novel podosome-linked adaptor protein Fish, and progressively phosphorylates the gap junction component connexion 43.(Frame 2004, PMID 14996930)
Knowledge
The city council refused the marchers a permit because they {feared|advocated} violence.
If V = feared then they = the city councilIf V = advocated then they = the marchers
Knowledge
• 890:11907280|found that cyclin A and cyclin E are able to regulate both nuclear and cytoplasmic events because they both shuttle between the nucleus and the cytoplasm
• Candidate antecedents:– cyclin A and cyclin E– nuclear and cytoplasmic events– cytoplasmic events
Knowledge
• 7015:12839932|Peptide nucleic acids are taken up into human tumor cells when they are components of a peptide fragment of this enzyme.
• Candidate antecedents:– peptide nucleic acids– human tumor cells
• 652:11867524|Bmp-4 only activates Dkk-1 when it concomitantly induces apoptosis. Implanted recombinant human Bmp-4 beads abolish Dkk-1 transcription in chick limb buds and mouse embryo cells.
• Different from the previous—not class-based, but instance-based…
• A protein complex has a structural organization
• A gene can have an exon• A gene can have a polymorphism• An important role cannot have an exon or
a polymorphism• A gene can have an interaction with a
gene• A mutation cannot have an interaction
with a gene
Uncontroversial part
• Three things you need to be able to do– POS tagging– Shallow parsing– Number agreement
• If there’s only one potential antecedent, it’s easy…
• …but if there’s more than one, then you have to make a choice.
Two or three broad categories of solutions
• Search• Heuristic
– symbolic– numerical
• Machine learning
Rule-based
Explicit discourse
model
Hobbs 19what??
• 1976: tech report• 1978: published in Lingua• 1986: reprinted in Grosz, Sparck-
Jones, and Webber, Readings in Natural Language Processing
Hobbs 1978
• Assessment of difficulty of problem• Incidence of the phenomenon• A simple algorithm that has
become a baseline (sorta)• Hobbs distance: ith candidate NP
considered by the algorithm is at a Hobbs distance of i
Hobbs’s point
…the naïve approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well, and these results set a very high standard for any other approach to aim for.
Hobbs’s point
Yet there is every reason to pursue a semantically based approach. The naïve algorithm does not work. Any one can think of examples where it fails. In these cases it not only fails; it gives no indication that it has failed and offers no help in finding the real antecedent.
(p. 345)
The NP vs. N’ distinction
A student of linguisticsA student of little promiseA student of linguistics of little
promise* A student of little promise of
linguistics
The NP vs. N’ distinction
[A student [of linguistics]][A student] [of little promise][A student [of linguistics] [of little
promise]* A student of little promise of
linguistics
The NP vs. N’ distinction
[A student [of linguistics]N’]
[A studentN’] [of little promise]
[A student [of linguistics]N’] [of little promise]
* A student of little promise of linguistics
The NP vs. N’ distinction
[[A student [of linguistics]N’]NP]
[[A studentN’] [of little promise]NP]
][A student [of linguistics]N’] [of little promise]NP]
* [A student of little promise of linguisticsNP]
The NP vs. N’ distinction
• Step (6) refers to NP (N”) and N’.• Resolution of Localization of the
GTP-bound form of ARF6 at the plasma membrane makes it…
The algorithm (my best guess)
• (1-3): conditional on intervening S or NP keeps you from positing driver for Fig. 1(b), but allows it for Fig. 1(a)
• (4): If you’re going to next sentence, you’re picking out subjects preferentially
• (5-6): conditional on path keeps you from positing driver for Fig. 1(b), but allows it for Fig. 1(a)
• (7): Prefer subject within the current sentence• (8): If this is a cataphor, you need to look to the
right, but don’t go into the next clause
The algorithm: practice
• Kevin bought a cookie and ate it.• Kevin bought a cookie. He ate it.• Kevin bought a cookie. It was yummy.• The dog found a biscuit and ate it.• The dog found a biscuit. It ate it.• The dog found a biscuit. It was
yummy.
The algorithm: evaluation
• What actually gets tackled in this paper?– Evaluation set: he, she, it, they– Exclusions:
The algorithm: evaluation
• Corpus:– Early civilization in China (book, non-
fiction)– Wheels (book, fiction)– Newsweek (magazine, non-fiction)
• First 100 consecutive pronouns from each
The algorithm: results
• Overall, no selectional constraints: 88.3%
• Overall, with selectional constraints: 91.7%
• ????
The algorithm: results
• This is somewhat deceptive since in over half the cases there was only one nearby plausible antecedent. (p. 344)
• 132/300 times, there was a conflict• 12/132 resolved by selectional
constraints, 96/120 by algorithm• Thus, 81.8% of the conflicts were
resolved by a combination of the algorithm and selection.
Hobbs’s point
…the naïve approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well, and these results set a very high standard for any other approach to aim for.
Hobbs’s point
Yet there is every reason to pursue a semantically based approach. The naïve algorithm does not work. Any one can think of examples where it fails. In these cases it not only fails; it gives no indication that it has failed and offers no help in finding the real antecedent.
(p. 345)
Adaptation for shallow parse (Kehler et al.)
Shallow parse: lowest-level constituents only; for coref, “base NP’s” (noun and all modifiers to the left)
a good student of linguistics with long hair
The castle in Camelot remained the residence of the king until 536 when he moved it to London.
Adaptation for shallow parse
…noun groups are searched in the following order:
1. In current sentence, R->L, starting from L of PRO
2. In previous sentence, L->R3. In S-2, L->R4. In current sentence, L->R, starting
from R of PRO(Kehler et al. 2004:293)
Adaptation for shallow parse
1. In current sentence, R->L, starting from L of PRO
1. he: no AGR2. 536: dates can’t move3. the king: no AGR4. the residence: OK!
My modification for GeneRIFs
• Always start L->R, even in current sentence (unless reflexive, or even if reflexive??)
• Data:
Ge, Hale, and Charniak 1998
A statistical approach to anaphora resolution. Proceedings of the sixth workshop on very large corpora, pp. 161-170.
What can we take from this?
• Implementation details– modularity aids in analysis– a baseline: closest NP as referent
(ignoring even AGR)• Analysis
– incrementalism, or at least modularity• Reference
– This seems to be the origin of the notion of the "Hobbs distance"
Approaches to the coref problem
• Search• Rules
– symbolic– numerical
• Machine learning
Hobbs 1977 (last time)
Ge, Hale, and Charniak (this time)
Focus on features
• What they are• Why you might suspect they would be
helpful (constraints, restrictions, preferences)
• Things to think about– How would feature set differ if you were
looking at non-pronominal coreference?– Are features getting at
syntactic/semantic/discourse?
Syntactic/semantic/discourse-related features• Syntactic
– +/- PP– subj/obj– distance/path– gender/number AGR
• Semantic– verb ID– semantic class/entity identification– thematic role– MED
• Discourse– repeated mention
Not a partition...
The distance feature
• The recency preference: "…entities introduced in recent utterances are more salient than those introduced from utterances further back."– John has an Integra.
Bill has a Legend. Mary likes to drive it.
• Build syntactic parse tree
• Run HNA, but don't stop until you've found 15 candidates
• ith candidate is at dH = i
The agreement feature
• Constraint, but approximated probabilistically• "The probability that a referent is in a particular
gender class is just the relative frequency with which that referent is referred to by a pronoun p that is part of that gender class"
• Noisy:– my favorite aunt she– my favorite book it– P(she|aunt) = 1.0– P(she|favorite) 0.5– Dunning test to find most informative word
773:14534930|We report on a family with ataxia type 6 (SCA6) showing peculiar oculomotor symptoms. They carried the identical mutation (the number of expanded CAG repeat, 24) in the CACNA1A gene.
Head word feature
• I picked up the book and sat in a chair. It broke.
• If verb is eat then referent is probably food, if verb is drink then referent is probably liquid, etc.
• John needed a car to get to his new job.
• He decided that he wanted something sporty.
• Bill went to the Acura dealership with him.
• He bought an Integra.John
The repeated mention feature
• "…entities that have been focused on in the prior discourse are more likely to continue to be focused on in subsequent discourse…"– John needed a car to get to his new
job. He decided that he wanted something sporty. Bill went to the Acura dealership with him. He bought an Integra.
Evaluation
• Third-person singular pronouns (he, she, it) in various forms
• “Newswire” text, pre-parsed
Results
• 82.9% overall
• Hobbs distance: 65.3%• …plus gender and animacy: 75.7%• …plus head feature: 77.9%• …plus repeated mention: 82.9%
This is knowledge!
So is this—but, is/has…
Carping
• Level of detail on inputs is not good• Level of detail on results is not good
– What if the "increments" were ordered differently? Were applied singly? Seems pretty important….
• Implementation detail sometimes vague
What can we take from this?
• Implementation details– modularity aids in analysis– a baseline: closest NP as referent
(ignoring even AGR)• Analysis
– incrementalism, or at least modularity• Reference
– This seems to be the origin of the notion of the "Hobbs distance"
Jackson & Moulinier (2002)An alternate approach to hand-written heuristic rules is to have a program learn preferences among antecedents from sample data. Researchers at Brown University used a corpus of Wall Street Journal articles marked with coreference information to build a probabilistic model for this problem. This model then informs an algorithm for finding antecedents for pronouns in unseen documents. The model considers the following factors when assigning probabilities to pronoun-antecedent pairings.
Distance between the pronoun and the proposed antecedent, with greater distance lowering the probability. Hobbs's Naïve Algorithms used to gather candidate antecedents, which are then rank ordered by distance. The probability that the correct antecedent lies at distance rank d from the pronoun is then computed from corpus statistics as (the number of correct antecedents at distance d) / (the total number of correct antecedents).Mention count. Noun phrases that are mentioned repeatedly are preferred as antecedents. As well as counting mentions of referents, the authors make an adjustment for position of the pronoun in the document. The later in the document a pronoun occurs, the more likely it is that its referent will have been mentioned multiple times.Syntactic analysis of the context surrounding the pronoun, especially where reflexive pronouns are concerned. Preferences for antecedents in the subject position and special treatment of reflexive pronouns are supplied by the Hobbs algorithm.Semantic distinctions, such as number, gender, and animate/inanimate, which make certain pairings unlikely or impossible. Given a training corpus of correct antecedents, counts can be obtained for such semantic features.
The probability that a pronoun corefers with a given antecedent is then computed as a function of these four factors, and the winning pair is the one that maximizes the probability of assignment. The authors performed an experiment to test the accuracy of the model on the singular pronouns ('he,' 'she,' and 'it'), and their various possessive and reflexive forms ('his,' 'hers,' 'its,' 'himself,' 'herself,' 'itself'). They implemented their model in an incremental fashion, enabling the contribution of the various factors to be analyzed. The results were quite interesting, and can be summarized as follows. Ignoring Hobbs's algorithm, and simply choosing the closest noun phrase as the referent, had a success rate of only 43%. Using the syntactic analysis afforded by the Naïve Algorithm increased accuracy to 65%. Adding semantic information, such as gender, raised the success rate to 76%. Adding additional information, such as mention counts, obtained a final increment to 83%. In restricting themselves to singular pronouns with concrete referents, the authors set out to solve a simpler problem that that addressed by CogNIAC [a rule-based system discussed earlier in the chapter,] but the results are still impressive. These are very common usages, and there is considerable utility for text mining in being able to analyze them accurately. (pp. 189-191)
Jackson & Moulinier (2002)
• They never mention the head feature
• They overstate the whole "syntactic analysis" thing—it's not a feature per se, except to the extent that it's included in the Hobbs distance and Hobbs distance is actually helpful
Mitkov (2002)Ge, Hale and Charniak (1998) propose a statistical framework for resolution of third person anaphoric pronouns. They combine various anaphora resolution factors into a single probability which is used to track down the antecedent. The program does not rely on hand-crafted rules but instead uses the Penn Wall Street Journal Treebank to train the probabilistic model. The first factor the authors make use of is the distance between the pronoun and the candidate for an antecedent. The greater this distance, the lower the probability for a candidate NP to be the antecedent. The so-called 'Hobbs's distance' measure is used in the following way. Hobbs's algorithm is run for each pronoun until it has proposed N (in this case N = 15) candidates. The Kth candidate is regarded as occurring at 'Hobbs's distance = K. Ge and co-workers rely on features such as gender, number and animacy of the proposed antecedent. Given the words contained in an NP, they compute the probability that this NP is the antecedent of the pronoun under consideration based on probabilities computed over the training data, which are marked with coreferential links. The authors also make use of co-occurrence patterns by computing the probability that a specific candidates occurs in the same syntactic function (e.g. object) as the anaphor. The last factor employed is the mention count of the candidate. Noun phrases that are mentioned more frequently have a higher probability of being the antecedent; the training corpus is marked with the number of times an NP is mentioned up to each specific point. The four probabilities discussed above are multiplied together for each candidate NP. The procedure is repeated for each NP and the one with the highest probability is selected as the antecedent. For more on the probabilistic model and the formulae used, see Ge et al. (1998). The authors investigated the relative importance of each of the above four probabilities (factors employed) in pronoun resolution. To this end, they ran the program 'incrementally,,' each time incorporating one more probability. Using only Hobbs's distance yielded an accuracy of 65.3%, whereas the lexical information about gender and animacy brought the accuracy up to 75.7%, highlighting the latter factor as quite significant. The reason the accuracy using Hobbs's algorithm was lower than expected was that the Penn Treebank did not feature perfect representations of Hobbs's trees. Contrary to initial expectations, knowledge about the governing constituent (co-occurrence patterns) did not make a significant contribution, only raising the accuracy to 77.9%. One possible explanation could be that selectional restrictions are not clear-cut in many cases; in additions, some of the verbs in the corpus such as is and has were not 'selective' enough. Finally, counting each candidate proved to be very helpful, increasing the accuracy to 82.9%.
Two or three broad categories of solutions
• Search• Heuristic
– symbolic– numerical
• Machine learning
Rule-based
Explicit discourse
model
CogNIAC
• 328:12230304|APE/REF1 is increased & competent in the brain & spinal cord of individuals with amyotrophic lateral sclerosis. It is upregulated in spinal cord astrocytes & white matter pathways in familial ALS.
• APE/REF1; it• the brain; it• spinal cord; it• individuals; they…• amyotrophic lateral
sclerosis; it
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• “Unique in discourse?”
• No
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• “If pronoun is reflexive (itself), pick most recent…”
• Vacuous
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• “Unique in current and prior sentences?”
• No
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• Possessive?
• No
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• Unique in current sentence?
• No
• APE/REF1; it• the brain; it• spinal cord; it• individuals;
they…• amyotrophic
lateral sclerosis; it
• Is there a possible match in subject position of preceding sentence, and is pronoun subject of its sentence?
• Yes!
Two or three broad categories of solutions
• Search• Heuristic
– symbolic– numerical
• Machine learning
Rule-based
Explicit discourse
model
A numerical/heuristic solution
• Castaño, Zhang, and Pustejovsky: Anaphora resolution in biomedical literature
• (Pre)processing and resources:– POS tagging– shallow parsing– acronym definition– semantic information from UMLS– pleonastic detection with regex
• Find the set of possible antecedents• If only one possible, then pick it• If more than one, then assign each
a score– Start with zero– Negative numbers are penalties– Highest number wins
Andy Kehler’s big question about anaphora
If anaphora is hard to resolve, why do we use it so much??
References
• Mitkov, Ruslan (2002) Anaphora resolution pp. 117-118 Pearson Education, Ltd.
• Kehler, Andrew (2002) Coherence, reference, and the theory of grammar
• Kehler, Andrew (2000) Discourse. In Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (Jurafsky and Martin 2000)
• Jackson, Peter; and Isabelle Moulinier (2002) Natural language processing for online applications: text retrieval, extraction and categorization pp. 189-190
• Manning, Christopher; and Heinrich Schuetze (1999) Foundations of statistical natural language processing
Oddities of GRIFs with respect to coreference
resolution• Impose a very small scope on the
search space• Exophoric 3rd-person pronouns• Weird notion of “sentence” in
written language• All NP’s are pretty recent• 255 chars don’t let you repeat much