predicting sentence specificity, with applications to news summarization ani nenkova, joint work...

49
Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Upload: gianni-chard

Post on 29-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Predicting sentence specificity, with applications to news summarization

Ani Nenkova, joint work with Annie Louis

University of Pennsylvania

Page 2: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Motivation A well-written text is a mix of general

statements and sentences providing details

In information retrieval: find relevant and well-written documents

Writing support: visualize general and specific areas

Page 3: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Supervised sentence-level classifier for general/specific Training data

Used existing annotations for discourse relations from PDTB

Features Lexical, language model, syntax, etc

Testing data Annotators judged more sentences

Applications to analysis of summarization output Automatic summaries too specific, worse for that

Page 4: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Training data

Penn discourse tree bank

Page 5: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Penn Discourse Treebank (PDTB)

Largest annotated corpus of explicit and implicit discourse relations

1 million words of Wall Street Journal

Arguments – spans linked by a relation (Arg1, Arg2)

Sense – semantics of the relation (3 level hierarchy)

I love ice-cream but I hate chocolates.(discourse connectives)

I came late. I missed the train.(adjacent sentences in the same paragraph)

5

Page 6: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Distribution of relations between adjacent sentences

(Adjacent sentences linked by an entity. Not considered a true discourse relation.)

6

Page 7: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

7

Training data from PDTB Expansions

Expansion

Conjunction[Also, Further]

Restatement[Specifically, Overall]

Instantiation[For example]

List[And]

Alternative[Or, Instead]

Exception[except]

Specification

Equivalence

Generalization Conjunctive Disjunctive

Chosen alternative

7

Page 8: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Instantiation example

The 40 year old Mr. Murakami is a publishing sensation in Japan.

A more recent novel, “Norwegian wood”, has sold more than forty million copies since Kodansha published it in 1987.

8

Page 9: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Examples of general /specific sentences

Despite recent declines in yields, investors continue to pour cash into money funds.

Assets of the 400 taxable funds grew by $1.5 billion during the latest week, to $352 billion. [Instantiation]

By most measures, the nation’s industrial sector is now growing very slowly—if at all.

Factory payrolls fell in September. [Specification]

9

Page 10: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Experimental setup—Two classifiers Instantiations-based

Arg1: General, Arg2: specific 1403 examples

Restatement#Specifications-based Arg1: General, Arg2: specific 2370 examples

Implicit relations only 50% baseline accuracy; 10 fold-cross

validation; Logistic regression

10

Page 11: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Features

Developed from a small development set 10 pairs of specification 10 pairs of instantiation

Page 12: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Features for general vs specific Sentence length: no. of tokens, no. of nouns

Expected general sentences to be shorter

Polarity: no. of positive/ negative/ polarity words, also normalized by length General Inquirer MPQA subjectivity lexicon In dev set, sentences with strong opinion are general

Language models: unigram/ bigram/ trigram probability & perplexity Trained on one year of New York Times news In dev set, general sentences contained unexpected, catchy

phrases12

Page 13: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Features for general vs specific

Specificity min/ max/ avg IDF WordNet: hypernym distance to root for nouns and

verbs—min/ max/ avg

Syntax: No. of adjectives, adverbs, ADJP, ADVP, verb phrases, avg VP length

Entities: Numbers, proper names, $ sign, plural nouns

Words: count of each word in the sentence13

Page 14: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Accuracy of general/specific classifier using Instantiations

50 55 60 65 70 75 80

verbs

sent len.

polarity

syntax

specificity

lang. md.

entities

words

all

all-words

Accuracy

14

Best: 76% accuracy

Page 15: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Accuracy of general/specific classifier using Specifications

50 55 60 65

polarity

verbs

lang. md.

entities

sent len.

specificity

syntax

words

all

all-words

Accuracy15

Best: 60% accuracy

Page 16: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Instantiation based classifier gave better performance

Best individual feature set: words (74.8%) Non-lexical features are equally good: 74.1%

No improvement by combining: 75.8%

16

Page 17: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Feature analysis Words with highest weight [Instantiation-based]

General: number, but, also, however, officials, some, what,

lot, prices, business, were… Specific: one, a, to, co, I, called, we, could, get…

General sentences are characterized by Plural nouns Dollar sign Lower probability More polarity words and more adjectives and adverbs

Specific sentences are characterized by Numbers and names

Page 18: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

More testing data

Direct judgments of WSJ and AP sentences on Amazon Mechanical Turk

~ 600 sentences 5 judgments per sentence

Page 19: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Agree TotalWSJ

GeneralWSJ

SpecificWSJ

TotalAP

GeneralAP

SpecificAP

5 96 51 45 108 33 75

4 102 57 45 91 35 56

3 95 52 43 88 49 39

Total 294 160 133 292 117 170

In WSJ, more sentences are general (55%)In AP, more sentences are specific (60%)

Page 20: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Why the difference between Instantiation and Specification? Some of the annotations were on our initial

training data

20

Instantiation (32)

General Specific

Arg1 29 3

Arg2 6 26

Specification (16)

General Specific

Arg1 10 6

Arg2 8 8

Has more detectable properties

associated with Arg1 and Arg2

Page 21: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Accuracy of classifier on new dataExamples

All features

Nonlexical

Words

All features

Nonlexical

Words

5 Agree 90.6 96.8 84.3 69.4 94.4 78.7

4+5 Agree

80.8 88.8 77.7 65.8 89.9 74.8

All 73.7 76.7 71.6 59.2 81.1 67.5

Non-lexical features work better on this dataPerformance is almost the same as in cross validation

Classifier is more accurate on examples where people agreeClassifier confidence correlates with annotator agreement

Page 22: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

22

Application of our classifier to full articles

Distribution of general/specific sentences in news documents

Can the classifier detect differences in general/specific summaries by people

Do summaries have more general/specific content compared to input? How does it impact summary quality?

Compare different types of summaries Human abstracts: written from scratch Human extracts: select sentences as a whole from inputs System summaries: all extracts

22

Page 23: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later. [general]

The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said.

[specific]

Example general and specific predictions

23

Page 24: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

24

Example predictions

The novel, a story of a Scottish low-life narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers who have damned all six books shortlisted for the prize as boring, elitist and – worse of all – unsaleable.

…The Booker prize has, in its 26-year history, always

provoked controversy.

24

Specific

General

Page 25: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Computing specificity for a text

Sentences in summary are of varying length, so we compute a score on word level “Average specificity of words in the text”

25

S1: w12w11 …w13

S2: w22w21 …w23

S3: w32w31 …w33

Confidence for beingin specific class

0.23

0.81

0.680.68 0.68 0.68 0.68

0.23 0.23 0.23 0.23

0.81 0.81 0.81 0.81

Average score on tokens

Specificity score

Page 26: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

50 specific and general human summariesText General category Specific category

Summaries 0.55 0.63

Inputs 0.63 0.65

No significant differences in specificity of the input

Significant differences in specificity of summaries in the two categories

Our classifier is able to detect the differences

Page 27: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Data: DUC 2002

Generic multidocument summarization task

59 input sets 5 to 15 news documents

3 types of summaries 200 words Manually assigned content and linguistic quality scores

1. Humanabstracts

27

2. Humanextracts

3. Systemextracts

2 assessors * 59 2 assessors * 59 9 systems * 59

Page 28: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Specificity analysis of summaries

1. More general content is preferred in abstracts

2. Simply the process of extraction makes summaries more specific

3. System summaries are overly specific

28

0.7 0.80.6

Inputs (0.65)

H. Abs (0.62)

S.ext (0.74)

H.ext (0.72)

[Avg. specificity]

Page 29: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Histogram of specificity scores

Human summaries are more general

Is the aspect related to summary quality?

Page 30: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Analysis of ‘system summaries’: specificity and quality

1. Content quality Importance of content included in the summary

2. Linguistic quality How well-written the summary is perceived to be

3. Quality of general/specific summaries When a summary is intended to be general or specific

30

Page 31: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

31

Relationship to content selection scores Coverage score: closeness to human summary

Clause level comparison

For system summaries Correlation between coverage score and average

specificity -0.16*, p-value = 0.0006

Less specific ~ better content

Page 32: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

But the correlation is not very high

Specificity is related to realization of content Different from importance of the content

Content quality = content importance + appropriate specificity level

Content importance: ROUGE scores N-gram overlap of system summary and human summary Standard evaluation of automatic summaries

32

Page 33: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Specificity as one of the predictors

Coverage score ~ ROUGE-2 (bigrams) + specificity

Linear regression

Weights for predictors in the regression model

33

Mean β Significance (hypothesis β = 0)

(Intercept) 0.212 2.3e-11

ROUGE-2 1.299 < 2.0e-16

Specificity -0.166 3.1e-05

Is the combination a better predictor than ROUGE alone?

Page 34: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

2. Specificity and linguistic quality

Used different data: TAC 2009 DUC 2002 only reported number of errors Were also specified as a range: 1-5 errors

TAC 2009 linguistic quality score Manually judged: scale 1 – 10 Combines different aspects

coherence, referential clarity, grammaticality, redundancy

34

Page 35: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

What is the avg specificity in different score categories?

More general ~ lower score! General content is useful

but need proper context!

35

Ling score No. summaries

Poor (1, 2) 202

Mediocre (5) 400

Best (9, 10) 79

If a summary starts as follows:“We are quite a ways from that, actually.”As ice and snow at the poles melt, …

Specificity = lowLinguistic quality = 1

Average specificity

0.71

0.72

0.77

Page 36: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Data for analysing generalization operation Aligned pairs of abstract and source sentences

conveying the same content Traditional data used for compression experiments

Ziff-Davis tree alignment corpus 15964 sentence pairs Any number of deletions, up to 7 substitutions

Only 25% abstract sentences are mapped But beneficial to observe the trends

36

[Galley & McKeown (2007)]

Page 37: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Generalization operation in human abstracts

Transition

SS

SG

GG

GS

37

One-third of all transformations are specific to general

Human abstracts involve a lot of generalization

No. pairs % pairs

6371 39.9

5679 35.6

3562 22.3

352 2.2

Page 38: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

How specific sentences get converted to general?

SG

SS

GG

GS

38

Orig. length

33.5

33.4

21.5

22.7

New/orig length

40.8

56.6

60.8

66.0

Avg. deletions(words)

21.4

16.3

9.3

8.4

Choose long sentences and compress heavily!

A measure of generality would be useful to guide compression Currently only importance and grammaticality are used

Page 39: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Use of general sentences in human extracts

Details of Maxwell’s death were sketchy. Folksy was an understatement. “Long live democracy!” Instead it sank like the Bismarck.

Example use of a general sentence in a summary…With Tower’s qualifications for the job, the nominations should

have sailed through with flying colors. [Specific]Instead it sank like the Bismarck. [General]…Future: can we learn to generate and select general sentences to

include in automatic summaries?

Page 40: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Conclusions Built a classifier for general and specific

sentences Used existing annotations to do that But tested on new data and task-based

evaluation

The confidence of the classifier is highly correlated with human agreement

Analyzed human and machine summaries Machine summaries are too specific But adding general sentences is difficult because

the context has to be right

Page 41: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Further details in Annie Louis and Ani Nenkova, Automatic identification of general and specific

sentences by leveraging discourse annotations, Proceedings of IJCNLP, 2011 (To Appear).

Annie Louis and Ani Nenkova, Text specificity and impact on quality of news summaries, Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011.

Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment, Proceedings of NAACL-HLT 2010.

Page 42: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Two types of local coherence—Entity & Rhetorical

Local coherence: Adjacent sentences in a text flow from one to another

Entity – same topic John was hungry. He went to a restaurant.

But only 42% sentence pairs are entity-linked [previous corpus studies]

Will core discourse relations connect the non-entity sharing sentence pairs? Popular hypothesis in prior work

42

Page 43: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Investigations into text quality The mix of discourse relations in a text is

highly predictive of the perceived quality of the text

Both implicit and explicit relations are needed to predict text quality

Predicting the sense of implicit discourse relations is a very difficult task; most predicted to be “expansion”

How is local coherence created?

Page 44: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Joint analysis by combining PDTB and Ontonotes annotations 590 articles Noun phrase coreference from Ontonotes

40 to 50% of sentence pairs do not share entities in articles of different lengths

44

Page 45: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Expansions cover most of non-entity sharing instances

45

Page 46: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Expansions have the least rate of coreference

46

Page 47: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Rate of coreference in 2nd level elaboration relations

47

Page 48: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Example instantiations and list relations

Instantiation The economy is showing signs of weakness, particularly

among manufacturers.

Exports which played a key role in fueling growth over the last two years, seem to have stalled.

List Many of Nasdaq's biggest technology stocks were in the

forefront of the rally.

- Microsoft added 2 1/8 to 81 3/4 and Oracle Systems rose 1 1/2 to 23 1/4.

- Intel was up 1 3/8 to 33 3/4.

48

Page 49: Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Overall distribution of sentence pairs among the two coherence devices

49

30% sentence pairs have no coreference and are in a weak discourse relation (expansion/entrel)

We must explore elaboration more closely to identify how they create coherence