finding and using rhetorical- semantic relations in text sasha blair-goldensohn 28 april 2005

31
Finding and Using Rhetorical-Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Finding and Using Rhetorical-Semantic Relations in Text

Sasha Blair-Goldensohn

28 April 2005

Page 2: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Outline

• Background

• Relations and Definitional QA

• Exploring Statistical Techniques for Relation Finding

• Using Mined Relations For Fun and Profit

Page 3: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Situating This Talk

• Various levels of textual relations (a.k.a. predicates)– Word-level, e.g. hypernym-hyponym

• WordNet catalogs many of these

– Syntactic, e.g. verb-argument– Propositional, e.g. agent-patient

• Wide array of work on parsers for syntactic and propositional structure can derive relations at the sentence level

– Rhetorical, e.g. cause-effect, contrast• Work in this domain more theoretical, no “general use” parser

• This talk– How rhetorical-type relations can be useful for a particular task

• Interaction between rhetorical and word-level relations

– Experiments in detecting and using these relations

Page 4: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Motivation

• Definitional Questions– “What/Who is X?”

• Concepts / Things / Processes: Muzak, thin layer chromatography, Hogwarts, Aum Shinrikyo, etc.

• People: Sonia Gandhi, Neil Diamond

• Exploratory manual analysis of definitions – Some properties consistently “good” across topics

• e.g., Superordinate, Cause-Effect, Contrast

– Other “good” properties harder to generalize• Different for a chemical procedure (applications, process

components) vs. a cult (founder, beliefs, membership)– Templates could be useful here for certain broad categories

(people, organizations, etc.)– … but our focus is on a system to define any term

Page 5: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

DefScriber: A Hybrid System– Knowledge-driven: three predicates (a.k.a. relations):

• Genus: category information (“Shiraz is a grape.”)• Species: differentiating the subject from other category

members (“Shiraz is used to make a popular style of red wine…”)

– Sentences containing both Genus and Species identified by pattern

• Non-specific Definitional (NSD): relevant information that may be impractical to classify generally (“Reds are now in favor in Australia, but in the 1970s white wine was more popular.”)

– NSD sentence identified (mainly) by function of term concentration

– Data-driven: statistical summarization-esque techniques to organize NSD information

• Separate core concepts from more marginal ones• Cluster key subtopics• Order sentences in using importance and cohesion

Page 6: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

ExampleSentence To Pattern

S

NP

VPThe Hindu Kush

PPNP

NP

between two major plates: the Indian

and Eurasian.

represents

the boundary

The original Genus-Species sentence A matching sentence

S

NP VP

The Hajj,

PPNP

NP

is

the central duty

NPNP

or

Pilgrimage to Makkah (Mecca),

of Islam.

MatchesInput

SentenceS

NP VP

NP

DT? TERM

contains

VP

FormativeVb

PPNP

NP

Genus PREP Species

The extracted partial syntax-tree pattern

contains

Pattern-Based Relation Identification (G-S)

Page 7: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Who is Sonia Gandhi?Congress President Sonia Gandhi, who married into what was once India’s most powerful political family, is the first non-Indian since independence 50 years ago to lead the Congress. After Prime Minister Rajiv Gandhi was assassinated in 1991, Gandhi was persuaded by the Congress to succeed her husband to continue leading the party as the chief, but she refused. The BJP had shrugged off the influence of the 51-year-old Sonia Gandhi when she stepped into politics early this year, dismissing her as a “foreigner.” Sonia Gandhi is now an Indian citizen. Gandhi, who is 51, met her husband when she was an 18-year old student at Cambridge in London, the first time she was away from her native Italy.

Example Output (From DUC 2004)

• Starting with Genus and Species information gives answer context• Word-based chaining of concepts for cohesion• Use of pronoun rewriting (Nenkova, 2003) to clarify initial

references and make later ones more fluid• Contrast reads well – but we were just lucky!• Statistical analysis (data-driven techniques) create a definition that

proceeds from more to less central topics– Five extracted sentences extracted from four different documents

Page 8: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Some Formal Evaluations

• Survey-based evaluation (2003)– Users rated five qualitative aspects of definitions– Showed significant improvement over query-focused

multi-document summarization

• Automatic and manual evals in DUC 2004 “Who is X?” task– Best results among 22 teams in automated (ROUGE)

evaluation (significantly better than 20) – Less distinguished in manual evaluation of coverage,

responsiveness, and quality • Little significant diff: on avg, 1.1 systems better, 2 worse• Because extractive task?

Page 9: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Informal Observations

• DefScriber Pros– Robust: Data-driven approaches will provide an

answer for any topic, dynamically• Stock answer for “Why not use Google definitions?”

– Nice answers when we find a G-S sentence and we have some coherent threads

• Cons– Predicate coverage for G-S only– Data-driven techniques are limited

• Similarity-based (word-overlap)• Use data from retrieved documents only (mod IDF)

Page 10: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Adding Predicates• We want to add predicates that are consistently

useful, e.g. Cause-Effect, Contrast– Approach of syntax-tree patterns with high precision

(~96%) but uneven recall, and requires significant manual effort

– Initial markup study indicates these predicates are stated in highly varied ways, and not always explicitly, e.g.

• E.g., “Diabetes is a disease of the endocrine system. Symptoms can include tiredness, thirst and the need to urinate frequently.”

• Idea: A technique to determine a relation using word pairs, even when it is not explicitly stated

Page 11: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Strengthening Data-driven Techniques• We want to strengthen our techniques, because

word-based similarity can limit us in some cases, e.g.:

• We would like to follow:– Tachyons are a class of particles which are able to travel faster

than the speed of light.

• With:– By extension of this terminology, particles that travel slower

than light are called tardyons, and particles, such as photons, that travel exactly at the speed of light are called luxons.

• but the felicitousness of this combination due to Contrast is missed by similarity-based metric

• Idea: A technique to use relations in addition to similarity / identity to a cohesion metric

Page 12: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Choosing an Approach

• Learning relationship content, e.g. that disease causes symptoms, or that faster contrasts with slower– Echihabi and Marcu (2002) use cue phrases to mine large

corpora to construct a word-pair-based classifier for four relations including Cause and Contrast and detect these relations across clauses or sentences

– Lapata and Lascarides (2004) use a similar approach for sentence-internal temporal relations (Before, After, During, etc.) using word pairs and other features like verb tenses

• As opposed to learning patterns– Snow, Jurafsky et al. (2005) use a supervised approach to learn

patterns for the hypernymy relation based on dependency-tree• e.g., “X is a Y”, “X, Y and other Z”, etc.

– Some issues including usefulness for non-explicit relations and cohesion application (more later)

Page 13: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

The Approach• Begin by following Echihabi and Marcu:

– Compile a small set of cue-phrases for each relation, e.g.• Cause: [Because X, Y], [X. As a consequence, Y], etc.• Contrast: [X. However, Y], [X even though Y], etc.• Baseline: Choose random non-contiguous sents from a document

– Mine a large amount of (noisy) data:• If we find a sentence: “Because [x1 x2 … xn] , [y1 y2 … ym] .”

• And note down that pairs (x1, y1) … (xn, ym) were observed in a causal setting

• So if we find: “Because [of poaching , smuggling and related treacheries], [tigers, rhinos and civets are endangered species] .

• … our belief that the pair (poaching ,endangered) indicates a causal relationship is increased

– Construct a naïve Bayes classifier s/t for two text spans W1 and W2, the probability of Relation rk is estimated as:

Page 14: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Goals

• Attain “good” accuracy– Not essential to exceed previous numbers

since we are concerned with application

• Apply model to address DefScriber “cons”– Make a system that can be used in an online

setting

• Consider alternative uses for model

Page 15: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

System Design

• Corpus: Aquaint collection (LDC) of approximately 20M sentences of newswire text from 1996-2000

• Mined examples of Cause and Contrast– Approx 407k cause– Approx 943k contrast– Trained system on approx 400k each, and added

400k “no relation” as baseline• “No relation” is taken as sentence pairs from the same

document which are at least 3 sents apart

• 64M word pairs with counts in MySQL Database– Efficiency concerns

Page 16: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Classification Task

• Given two text spans, predict the relation between them when cue patterns are removed

• Used 10k held out test data for each relation type– Baseline for binary classifier = 50%

Page 17: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Smoothing

• Where B is the number of unseen events. But with λ = 1, 94% of the probability space goes to unseen events

• We can experiment with smaller λ– Or estimate values empirically

BN

yxCyxPLap

),(

),(

• Our data is very sparse given the possible number of word pairs (99% of possible pairs unseen in 400k norel sentence pairs)

• Using LaPlace smoothing, we estimate the probability of a given word pair as:

Page 18: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Effect of λ ParameterBinary Classification @ 100k Training Exampes

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

1.0000 0.1000 0.0100 0.0010 0.0001

LaPlace Parameter

Acc

ura

cy

cause v norel

contrast v norel

Page 19: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Good-Turing Smoothing

• Smoothes all counts based on ratio of frequencies of frequencies– Gives N1/N = .08 probability to unseen events

• Depends on choice of smoothing function for higher frequencies where we have few examples

• In limited experiments, performed moderately worse than LaPlace (within .05)– May improve with more data (and effort!)

Page 20: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Stemming

• Experimented with Porter Stemmer to address sparsity– Improves classification accuracy marginally (<

1 percent)

• However, somewhat coarse-grained for other tasks– Currently using unstemmed models;

lemmatization might be better

Page 21: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Classification ResultsBinary Classification : Unstemmed / Laplace @ 0.01

0.69

0.76

0.80

0.73

0.80

0.85

0.75

0.64

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

100k 200k 400k 890k 3882k

Training Examples (not to scale)

Acc

ura

cy

cause vsnorel

contrast vsnorel

cause vsnorel - ref

contrast vsnorel - ref

Page 22: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Another Task: Term Suggestion• We can also use these models to look for pairs

of words which are most strongly linked for a given relation, e.g. Contrast

• Using log-likelihood measure a la Dunning– Null hypothesis is that for two terms w and t, the pair

(w,t) is equally likely for the Contrast model or not

– H0 = P(w,t|ContrastModel) = P(w,t|~ContrastModel) = P(w|t)

– So given a word w, we wish to suggest the term(s) t for which H0 is most unlikely

• Issues: Evaluation and Sparsity

Page 23: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Term Suggestion: an Example• Recall our example:

• Tachyons are a class of particles which are able to travel faster than the speed of light.

• By extension of this terminology, particles that travel slower than light are called tardyons, and particles, such as photons, that travel exactly at the speed of light are called luxons.

• Contrast terms above log-likelihood threshold• Speed: not, still, only, speed, average, exactly, football, slower, dial,

race, faster, isn’t, efficient, strength, toughness• Faster: buyer, perhaps, #unk#, speed• Class: not, restroom, island, mostly, individual, down, lost, subject,

guys, only, schools– Non-content terms: May indicate contrast language– Noise / context-specific suggestions– Useful terms: some antonyms, but also pseudo-coordinates, and

often term itself – we are more interested in rhetorical relevance more than strict relation

• Seems promising, but only anecdotal evidence here

Page 24: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Applying to Definitional Answers

• Several potential directions for algorithm input from relation models– As additional weight when selecting “next” sentence

by measuring cause/contrast-ness of pairing• Idea: encourage causal / contrast “chains” in the definition• Could be done as classification or with term suggestions

– Use term suggestions to boost “importance” measure at word level

• Idea: even if a sentence doesn’t seem ideal from a cohesion perspective, it may be important enough to insert anyway if it has strong relation links with the cluster as a whole

– “Needle in Haystack” issue

• Which terms to use as seeds for suggestion?

Page 25: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Contrast Chain Weighting

Idea: Use suggested terms rather than span classifier since textual regularities of adjacent sentences may be missing

Algorithm:1. Extract keywords K from current sent2. For each k in K

1. Get terms T with LogLike(Contrast(t,K)) > threshold2. For each potential next sent S, ContrastScore(S) =

WeightedOverlap(T,S)3. Choose best next S as a function of

ContrastScore(S) and other weights

Page 26: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Applying To Definitions:“What is bankruptcy?”

Old Answer:

There are two types of bankruptcy - Chapter 7 bankruptcy and Chapter 13 bankruptcy.People with insufficient assets or income could still file a Chapter 7 bankruptcy, which if approved by a judge erases debts entirely after certain assets are forfeited.File bankruptcy petition with the clerk of the bankruptcy courts.Bankruptcy spawns new restaurant Jan 25, 2005 Lansdale Reporter, According to United States Bankruptcy Court documents Memphis Magic filed for Chapter 11 bankruptcy on Oct. 29 which had voluntarily ...Some people file bankruptcy because of the automatic stay provision, the part of the bankruptcy code that offers legal protection against bill collectors.

New Answer:

There are two types of bankruptcy Chapter 7 bankruptcy and Chapter 13 bankruptcy.When a co-signer is involved in consumer debt situations, a Chapter 13 proceeding could protect the co-signer who has not also filed for bankruptcy protection.People with insufficient assets or income could still file a Chapter 7 bankruptcy, which if approved by a judge erases debts entirely after certain assets are forfeited.Just filing the bankruptcy does not breach the mortgage; filing to make payments according to the loan agreement is a breach.Personal debt pushes more into bankruptcy Jan 26, 2005 Manawatu Standard, The rules that apply to personal bankruptcy are similar to those that govern company bankruptcy: the slate is wiped clean after three years.

Page 27: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Further Uses for Model

• For coherence/cohesion in general-purpose summarization

• For answering causal or comparative questions– “Why did Dow-Corning go bankrupt?”

• Filter by terms that have causal relationship with bankruptcy

– “How fast is a lion?”• Filter by terms that are contrasted with fast

• As added weight on bootstrapped data for, e.g. opinions– If we believe term X has strong positive orientation, and we

believe X causes/contrasts reliably with Y, we can increase/decrease our belief about the positive orientation of Y

• As general tool for applications that can accept weaker inferences in exchange for broad coverage

Page 28: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Alternatives

• “Couldn’t you just use WordNet?”– Certainly complementary– WN has issues of coverage

• Number of terms, number of relations both limited• Much more precise, but doesn’t clearly contain things like the

“contrast” between speed and strength

– Probabilities over relations

• “What about patterns?”– Again complementary– Issues with explicit statement of relations– For methods like Snow et al., need training data

Page 29: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Issues• Sparsity

– More effort into smoothing (class-based methods, principled estimation for parameter-based techniques)

– Additional data, features

• Pattern inaccuracy– Estimated at up to 15% by Echihabi -- address with

syntax-aware patterns– e.g., " I think the bond is going to pass as it isbecause it ' s an excellent proposal , " [she said] .

– Pattern-learning can discover and rank patterns, but most methods need training data

• Evaluation– DUC, TREC, and others!

Page 30: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Wrap Up

• Building a model of certain rhetorical-semantic relations seems feasible

• Validated previous work on classification

• Exploring new avenues for applying these models to QA, summarization, and beyond

Page 31: Finding and Using Rhetorical- Semantic Relations in Text Sasha Blair-Goldensohn 28 April 2005

Document Retrieval

Predicate Identification

Data-Driven

Analysis

Definition Creation

383 Non-specific

Definitional sentences

Clusters, ordering

information

9 Genus-Species Sentences1. The Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.2. The Hajj is a milestone event in a Muslim 's life.3. The hajj is one of five pillars that make up the foundation of Islam.4. The hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. …

The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina. The hajj is one of five pillars that make up the foundation of Islam.

Goal-Driven•Use definitional predicates such as Genus and Species to search for sentences conveying typical definitional information.•Implementation combines feature-based classification and pattern recognition over syntax trees.

Data-Driven•Adapt techniques from summarization to maximize content importance, cohesion and coverage.•Implementation uses lexical distance for centroid-based clustering and cohesion metrics

11 Web documents, 1127 total sentences

Example Run: “What is the Hajj?”