1 do summaries help? a task-based evaluation of multi-document summarization kathleen mckeown,...

41
1 Do Summaries Help? Do Summaries Help? A Task-Based Evaluation of A Task-Based Evaluation of Multi-Document Summarization Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

1

Do Summaries Help?Do Summaries Help?A Task-Based Evaluation of A Task-Based Evaluation of Multi-Document SummarizationMulti-Document Summarization

Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova,

Julia Hirschberg

Department of Computer ScienceColumbia University

2

3

Status of Multi-Document Status of Multi-Document SummarizationSummarization Robust

Many existing systems (e.g. DUC 2004) http://newsblaster.cs.columbia.edu http://www.newsinessence.com

Extensive quantitative evaluation (intrinsic) DUC 2001 – DUC 2005 Comparison of system summary content against human

models

Do system generated summaries help end-users to make better use of the news?

4

Extrinsic EvaluationExtrinsic Evaluation

Task-based evaluation of single document summarization using IR

TIPSTER-II, Brandow et al, Mani et al, Mochizuki&Okumura

Other factors can determine result (Jing et al)

Evaluation of evaluation metrics using similar task as ours

Amigo et al

5

Task EvaluationTask Evaluation

Hypothesis: multi-document summaries enable users to find information efficiently

Task: fact-gathering given topic and questions Resembles intelligence analyst task

Compared 4 parallel news browsing systems Level 1: Source documents only Level 2: One sentence multi-document summaries (e.g., Google

News) linked to documents Level 3: Newsblaster multi-document summaries linked to

documents Level 4: Human written multi-document summaries linked to

documents

6

Results PreviewResults Preview

Quality of facts gathered significantly better Newsblaster vs. documents alone

User satisfaction higher Newsblaster and human summaries vs. documents and 1

sentence summaries

Summaries contributed important facts Newsblaster and human summaries vs. 1 sentence summaries

Full multi-document summarization more powerful than no documents or single sentence summarization

7

OutlineOutline

Study design and execution

Scoring

Results

8

Evaluation GoalsEvaluation Goals

Do summaries help users find information needed to perform a fact gathering task?

Do users use information from the summary in gathering their facts?

Do summaries increase user satisfaction with the online news system?

Do users create better fact sets with an online news system that includes summaries than one without?

How does type of summary (i.e., 1-sentence, system generated, human generated) affect quality of task output and user satisfaction?

9

Experimental DesignExperimental Design

Subjects performed four 30-minute fact-gathering scenarios

Prompt: topic description plus three questions

Given a web page as sole resource Space in which to compose response Instructed to cut and paste from summary or article Four event clusters per page

Two centrally relevant, two less relevant 10 documents per cluster on average

Complete survey after each scenario

10

PromptPrompt The conflict between Israel and the Palestinians

has been difficult for government negotiators to settle. Most recently, implementation of the "road map for peace," a diplomatic effort sponsored by the United States, Russia, the E.U. and the U.N., has suffered setbacks. However unofficial negotiators have developed a plan known as the Geneva Accord for finding a permanent solution to the conflict.

Who participated in the negotiations that produced the Geneva Accord?

Apart from direct participants, who supported the Geneva Accord preparations and how?

What has the response been to the Geneva Accord by the Palestinians and Israelis?

11

Experimental DesignExperimental Design

Subjects performed four 30-minute fact-gathering scenarios

Prompt: topic description plus three questions Produced a report containing a list of facts

Given a web page as sole resource Space in which to compose response Instructed to cut and paste from summary or article and

make citation Four event clusters per page

Two centrally relevant, two less relevant 10 documents per cluster on average

Complete survey after each scenario

12

Level 1: Documents only, no summaryLevel 1: Documents only, no summary

13

Level 2: 1-sentence summary for each Level 2: 1-sentence summary for each event cluster, 1-sentence summary for event cluster, 1-sentence summary for each articleeach article

14

Full multi-document summariesFull multi-document summaries

Neither humans nor systems had access to the prompt

Level 3: Generated by Newsblaster for each event cluster

Level 4 Human written summary for each event cluster Summary writers hired to write summaries

English or Journalism students with high verbal SAT

15

Levels 3 and 4: full summary for each event Levels 3 and 4: full summary for each event clustercluster

16

Experimental DesignExperimental Design

Subjects performed four 30-minute fact-gathering scenarios

Prompt: topic description plus three questions Produced a report containing a list of facts

Given a web page as sole resource Space in which to compose response Instructed to cut and paste from summary or article and

make citation Four event clusters per page

Two centrally relevant, two less relevant 10 documents per cluster on average

Complete survey after each scenario

17

Study ExecutionStudy Execution

45 Subjects with varied background 73% students (BS, BA, journalism, law) Native speakers of English Paid, with promise of monetary prize for best report

3 studies, controlling for scenario and level order, ~11 subjects/scenario/level

18

Results – What was MeasuredResults – What was Measured

Report content across summary conditions: levels 1-4

User satisfaction per summary condition based on user surveys

Source of report content (summary or article) by counting fact citations

19

Scoring Report ContentScoring Report Content

Compare subject reports against a gold standard Used the Pyramid method [HLT2004]

Avoids postulating an ideal exhaustive report Predicts multiple equally good reports Provides a metric for comparison

Gold standard for report x = pyramid of facts constructed from all reports except x

Relative importance of facts determined by report writers 34 reports per pyramid on average -> very stable

20

Pyramid representationPyramid representation

Tiers of differentially weighted facts

Top: few facts, high weight Bottom: many facts, low

weight Report facts that don’t

appear in pyramid have weight 0

Duplicate report facts get weight 0W=1

W=33

W=34

21

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

22

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

23

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

24

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

25

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

26

Ideally informative reportIdeally informative report

Does not include a fact from a lower tier unless all facts from higher tiers are included as well

Report LengthReport Length

Wide variation in length impacts scores

We restricted report length < 1 standard deviation above the mean by truncating question answers

28

Results - ContentResults - Content

Summary Level Pyramid Score

Level 1 (docs only) .3354

Level 2 (1 sentence) .3757

Level 3 (Newsblaster) .4269

Level 4 (Human) .4027

Report quality improves from level 1 to level 3. (One scenario was dropped from results as it was problematic for subjects)

29

Statistical AnalysisStatistical Analysis

ANOVA shows summary is marginally significant factor

Bonferonni method applied to determine differences in summary levels

Difference between Newsblaster and documents-only significant (P=.05)

Differences between Newsblaster and 1-sentence or human not significant

ANOVA shows that scenario, question and subject also significant factors

30

Results - User SatisfactionResults - User Satisfaction

6 questions in exit survey required response from a 1-5 scale

Average increases by summary type

Level 1 Level 2 Level 3 Level 4

Average 2.75 3.39 3.47 3.56

31

With full summaries, users read lessWith full summaries, users read less

Question Level 1 Level 2 Level 3 Level 4

A. What best describes your experience reading source articles?

2.83 2.70 3.10 3.10

1. I read a LOT more than I needed to

5. I only read those articles I needed to read

32

With Summaries, easier to write With Summaries, easier to write report and tended to have more timereport and tended to have more timeQuestion Level 1 Level 2 Level 3 Level 4

B. How difficult do you think it was to write the report?

2.27 3.07 2.95 3.0

1. Very difficult

5. Very easy

C. Do you feel you had enough time to write the report?

2.43 3.91 3.38 3.57

1. I needed more time

5. I had more than enough time

33

Usefulness improves with summary qualityUsefulness improves with summary qualityHuman summaries help best with timeHuman summaries help best with timeQuestion Level 1 Level 2 Level 3 Level 4

D. What best describes your experience using article summaries?

n/a 3.16 3.29 4.14

1. They had nothing useful to say

5. Everything I needed to know

E. Did you feel that the automatic summaries saved you time, wasted time, or had no impact on your time budget?

n/a 4.09 3.95 4.14

1. Summaries wasted time

5. Summaries saved me time

34

Multiple Choice Survey Multiple Choice Survey QuestionsQuestionsQuestion Level 2 Level 3 Level 41. Which was most helpful?

Source articles helped most 64% 48% 29%Equally helpful 32% 29% 29%Summaries helped most 5% 24% 43%2. How did you budget your time?

Most searching, some writing 55% 48% 67%Half searching, half writing 39% 29% 19%Most writing, some searching 7% 24% 14%

35

Citation PatternsCitation Patterns

Report writers were significantly more likely to extract facts from summaries with Newsblaster and human summaries

Level 2 Level 3 Level 4

Citations from summaries

8% 17% 27%

36

What we LearnedWhat we Learned

With summaries, a significant increase in quality of report

We hypothesized summaries would reduce reading time As summary quality increases, users significantly more often

draw facts from summary without decrease in report quality Users claim they read fewer full documents with level 3 and

4 summaries Full multi-document summarization better than 1

sentence summaries Almost 5 times the proportion of subjects using Newsblaster

summaries say summaries are helpful than subjects using 1 sentence summaries

37

Need for Follow-on StudiesNeed for Follow-on Studies

Why no significant increase in report quality from level 2 to level 3?

Interface differences Level 2 had summary for each article, level 3 did

not Level 3 required extra clicks to see list of articles

Studies to investigate controlling report length

Studies to investigate impact of scenario and question

38

39

40

Need for Follow-on StudiesNeed for Follow-on Studies

Why no significant increase in report quality from level 2 to level 3?

Interface differences Level 2 had summary for each article, level 3 did

not Level 3 required extra clicks to see list of articles

Studies to investigate controlling report length

Studies to investigate impact of scenario and question

41

ConclusionsConclusions

Do summaries help? Yes Our task-based, extrinsic evaluation yielded significant

conclusions

Full multi-document summarization (Newsblaster, human summaries) helps users perform better at fact-gathering than documents only

Users are more satisfied with full multi-document summarization than Google News style 1-sentence summaries