impact of citing papers for summarisation of clinical documents

26
Impact of Citing Papers for (Extractive) Summarisation of Clinical Documents Diego Moll´ a 1 Christopher Jones 1 Abeed Sarker 2 1 Macquarie University 2 Arizona State University Sydney, Australia Tempe, AZ, USA ALTA 2014, Melbourne, Australia

Upload: diego-molla-aliod

Post on 14-Jul-2015

127 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Impact of Citing Papers for Summarisation of Clinical Documents

Impact of Citing Papers for (Extractive)Summarisation of Clinical Documents

Diego Molla1 Christopher Jones1 Abeed Sarker2

1Macquarie University 2Arizona State UniversitySydney, Australia Tempe, AZ, USA

ALTA 2014, Melbourne, Australia

Page 2: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Citation-based summarisation

Extractive Summarisation

Build a summary by extracting text from the original document.

Citation-based summarisation

Use information from citing texts to build a summary.

I Citance: Text surrounding a reference in a citing paper.

⇐=

Citing Papers for Summarisation D Molla, C Jones, A Sarker 2/21

Page 3: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Contents

Background and Related Work

Finding the Best Fit to a Citance

Extracting the Summary

Citing Papers for Summarisation D Molla, C Jones, A Sarker 3/21

Page 4: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Contents

Background and Related Work

Finding the Best Fit to a Citance

Extracting the Summary

Citing Papers for Summarisation D Molla, C Jones, A Sarker 4/21

Page 5: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Related Work

Citation-based summarisation

I Garfield et al. (1964) performed manual citation analysis tocharacterise a publication.

I Nakov et al. (2004) used the text from citances to build asummary.

I Qazvinian & Radev (2010) automatised the extraction ofcitances.

I Further work on using citances as a surrogate of, or as anaddition to, the document summary (Mohammad et al. 2009;Abu-Jbara & Radev 2011).

Citing Papers for Summarisation D Molla, C Jones, A Sarker 5/21

Page 6: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

TAC2014 BiomedSumm Track

Data set

I Training and development: 20 documents.I 4 annotators per document.

I Test set: 30 documents.

I Each document has 10 referring papers.

Tasks

Task 1a Identify the text spans from the reference paper that mostaccurately reflect the text from the citance.

Task 1b Classify what facet of the paper a text span belongs to.

Task 2 Generate a structured summary of the reference paper and allof the community discussion of the paper represented in thecitances.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 6/21

Page 7: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Citances for Extractive Summarisation

Research Question

Can information from citances help to extract sentences from thereference paper?

Method

1. Rank sentences from reference paper according to their fit tocitances (BiomedSumm task 1a).

2. Integrate the sentence rankings into extractive summarisation.

3. Compare against versions that do not incorporate thesentence rankings.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 7/21

Page 8: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Contents

Background and Related Work

Finding the Best Fit to a Citance

Extracting the Summary

Citing Papers for Summarisation D Molla, C Jones, A Sarker 8/21

Page 9: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

General Approach

⇐=

Approach

1. Build vector models of each sentence (tf .idf or SVD).

2. Compute cosine similarity between sentence and citance.

3. Extract top n = 5 sentences (straight or after MMR).

Citing Papers for Summarisation D Molla, C Jones, A Sarker 9/21

Page 10: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

tf .idf and SVD

Finding best match to citance 1 in reference paper 1

Reference paper rp1rp sentence 1rp sentence 2· · ·

Citances to rp1citance 1citance 2· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1 · · ·rp 2 · · ·· · · · · ·c 1 · · ·c 2 · · ·· · · · · ·

}•

cosine

score rp 1score rp 2· · ·

Citing Papers for Summarisation D Molla, C Jones, A Sarker 10/21

Page 11: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Data to the tf .idf (SVD) Model

Topics All sentences from citing papers of reference paper 1.

Documents All sentences from all BiomedSumm documents.

Abstracts All sentences from a separate collection of 2,657PubMed abstracts.

Finding best match to citance 1 in reference paper 1

Reference paper rp1rp sentence 1rp sentence 2· · ·

Citances to rp1citance 1citance 2· · ·

Topicstop sentence 1top sentence 2· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1 · · ·rp 2 · · ·· · · · · ·c 1 · · ·c 2 · · ·· · · · · ·t 1 · · ·t 2 · · ·· · · · · ·

}•

cosine

score rp 1score rp 2· · ·

Citing Papers for Summarisation D Molla, C Jones, A Sarker 11/21

Page 12: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Data to the tf .idf (SVD) Model

Topics All sentences from citing papers of reference paper 1.

Documents All sentences from all BiomedSumm documents.

Abstracts All sentences from a separate collection of 2,657PubMed abstracts.

Finding best match to citance 1 in reference paper 1

Reference paper rp1rp sentence 1rp sentence 2· · ·

Citances to rp1citance 1citance 2· · ·

Documentsdoc sentence 1doc sentence 2· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1 · · ·rp 2 · · ·· · · · · ·c 1 · · ·c 2 · · ·· · · · · ·d 1 · · ·d 2 · · ·· · · · · ·

}•

cosine

score rp 1score rp 2· · ·

Citing Papers for Summarisation D Molla, C Jones, A Sarker 11/21

Page 13: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Data to the tf .idf (SVD) Model

Topics All sentences from citing papers of reference paper 1.

Documents All sentences from all BiomedSumm documents.

Abstracts All sentences from a separate collection of 2,657PubMed abstracts.

Finding best match to citance 1 in reference paper 1

Reference paper rp1rp sentence 1rp sentence 2· · ·

Citances to rp1citance 1citance 2· · ·

Abstractsabs sentence 1abs sentence 2· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1 · · ·rp 2 · · ·· · · · · ·c 1 · · ·c 2 · · ·· · · · · ·a 1 · · ·a 2 · · ·· · · · · ·

}•

cosine

score rp 1score rp 2· · ·

Citing Papers for Summarisation D Molla, C Jones, A Sarker 11/21

Page 14: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Context to the tf .idf (SVD) Model

Extend sentences with a context window of n sentences.

Finding best match to citance 1 in reference paper 1, contextwindow n = 20

Reference paper rp1rp sentences 1..10rp sentences 2..11· · ·

Citances to rp1citances 1..10citances 2..11· · ·

Topicstop sentences 1..10top sentences 2..11· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1..10 · · ·rp 2..11 · · ·· · · · · ·c 1..10 · · ·c 2..11 · · ·· · · · · ·t 1..10 · · ·t 2..11 · · ·· · · · · ·

}•

cosine

score rp 1score rp 2· · ·

Citing Papers for Summarisation D Molla, C Jones, A Sarker 12/21

Page 15: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Information from WordNet and UMLS

I Replace word with WordNet synsets / UMLS ID / UMLSsemantic types.

I Use word and WordNet synsets / UMLS ID / UMLS semantictypes.

I Linear combination of scores: 0.5× w + 0.2× c + 0.3× s.

Finding best match to citance 1 in reference paper 1, context window n = 20

Reference paper rp1Synsets/UMLS 1..10Synsets/UMLS 2..11· · ·

Citances to rp1citance 1..10citance 2..11· · ·

Topicstop sentence 1..10top sentence 2..11· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1..10 · · ·rp 2..11 · · ·· · · · · ·c 1..10 · · ·c 2..11 · · ·· · · · · ·t 1..10 · · ·t 2..11 · · ·· · · · · ·

}•

cosine

score 1score 2· · ·

c 1c 2· · ·

s 1s 2· · ·

+

Citing Papers for Summarisation D Molla, C Jones, A Sarker 13/21

Page 16: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Information from WordNet and UMLS

I Replace word with WordNet synsets / UMLS ID / UMLSsemantic types.

I Use word and WordNet synsets / UMLS ID / UMLS semantictypes.

I Linear combination of scores: 0.5× w + 0.2× c + 0.3× s.

Finding best match to citance 1 in reference paper 1, context window n = 20

Reference paper rp1Synsets/UMLS+words 1..10Synsets/UMLS+words 2..11· · ·

Citances to rp1citance 1..10citance 2..11· · ·

Topicstop sentence 1..10top sentence 2..11· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1..10 · · ·rp 2..11 · · ·· · · · · ·c 1..10 · · ·c 2..11 · · ·· · · · · ·t 1..10 · · ·t 2..11 · · ·· · · · · ·

}•

cosine

score 1score 2· · ·

c 1c 2· · ·

s 1s 2· · ·

+

Citing Papers for Summarisation D Molla, C Jones, A Sarker 13/21

Page 17: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Adding Information from WordNet and UMLS

I Replace word with WordNet synsets / UMLS ID / UMLSsemantic types.

I Use word and WordNet synsets / UMLS ID / UMLS semantictypes.

I Linear combination of scores: 0.5× w + 0.2× c + 0.3× s.

Finding best match to citance 1 in reference paper 1, context window n = 20

Reference paper rp1Sentence 1..10Sentence 2..11· · ·

Citances to rp1citance 1..10citance 2..11· · ·

Topicstop sentence 1..10top sentence 2..11· · ·

tf .idf (+SVD)w1w2 · · ·

rp 1..10 · · ·rp 2..11 · · ·· · · · · ·c 1..10 · · ·c 2..11 · · ·· · · · · ·t 1..10 · · ·t 2..11 · · ·· · · · · ·

}•

cosine

w 1w 2· · ·

c 1c 2· · ·

s 1s 2· · ·

+

Citing Papers for Summarisation D Molla, C Jones, A Sarker 13/21

Page 18: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Oracle

Oracle

I Use one annotator’s output as the system output.

I Compare against all other annotators.

I Average results.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 14/21

Page 19: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Results

System R P F1 F1 95% CI

Abstracts 0.190 0.230 0.193 0.179 - 0.208tf.idf 0.331 0.290 0.290 0.276 - 0.303MMR λ = 0.97 0.334 0.293 0.293 0.279 - 0.307SVD with 500 components 0.334 0.295 0.295 0.281 - 0.308

Topics 0.344 0.311 0.307 0.293 - 0.3210.2c + 0.3s + 0.5w 0.364 0.294 0.309 0.297 - 0.320MMR λ = 0.97 on topics 0.345 0.314 0.311 0.296 - 0.325Topics + context 20 0.333 0.334 0.312 0.297 - 0.3260.2c + 0.3s + 0.5w on topics + context 20 0.356 0.307 0.312 0.299 - 0.325Documents + context 20 0.334 0.336 0.314 0.299 - 0.327Documents 0.347 0.325 0.316 0.303 - 0.330Documents + abstracts 0.347 0.327 0.317 0.302 - 0.332MMR λ = 0.97 on topics + context 20 0.336 0.340 0.317 0.303 - 0.331

Topics + context 50 0.341 0.336 0.318 0.302 - 0.332

Oracle 0.442 0.484 0.413 0.404 - 0.421

Table : ROUGE-L results of TAC task 1a, sorted by F1. The best result is inboldface, and all results within the 95% confidence interval range of the bestresult are in italics.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 15/21

Page 20: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Contents

Background and Related Work

Finding the Best Fit to a Citance

Extracting the Summary

Citing Papers for Summarisation D Molla, C Jones, A Sarker 16/21

Page 21: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Oracle, Baselines

Oracle

I Use the data of one annotator.

I Compare against other annotators.

Baselines

I Score sentence i with the sum of tf .idf /SVD of its vector.

I Use same variants as in task 1a (add texts, context, WordNet,UMLS)

Citing Papers for Summarisation D Molla, C Jones, A Sarker 17/21

Page 22: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Using Citing Text

Approach

1. Rank sentences as in task 1a.

2. Combine ranks to produce a final score.

score(i) =∑

c∈citances

1− rank(i , c)

n

Citing Papers for Summarisation D Molla, C Jones, A Sarker 18/21

Page 23: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Results: BiomedSumm Summaries

System R P F1 F1 95% CI

Oracle 0.459 0.461 0.458 0.446 - 0.470

tf.idf 0.260 0.264 0.260 0.226 - 0.290SVD with 500 components 0.264 0.247 0.254 0.236 - 0.272Topics 0.260 0.265 0.261 0.226 - 0.292Documents 0.259 0.265 0.260 0.224 - 0.290Topics + context 5 0.259 0.265 0.261 0.226 - 0.291Topics + context 20 0.252 0.261 0.255 0.220 - 0.285

task1a (tf.idf) 0.384 0.375 0.378 0.350 - 0.408task1a (MMR λ = 0.97 on topics) 0.398 0.396 0.396 0.372 - 0.421task1a (MMR λ = 0.97 on topics + context 20) 0.420 0.407 0.412 0.385 - 0.438task1a (0.2c + 0.3s + 0.5w) 0.398 0.392 0.394 0.369 - 0.419task1a (0.2c + 0.3s + 0.5w on topics) 0.405 0.399 0.401 0.378 - 0.423task1a (0.2c + 0.3s + 0.5w on topics + context 20) 0.417 0.404 0.409 0.387 - 0.431

Table : Rouge-L results of task 2 using the TAC 2014 data. The summary sizewas constrained to 250 words. In boldface is the best result. In italics are theresults within the 95% confidence intervals of the best result.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 19/21

Page 24: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Results: Targetting Paper Abstracts

System R P F1 F1 95% CI

tfidf 0.293 0.192 0.227 0.190 - 0.261SVD with 500 components 0.291 0.181 0.218 0.197 - 0.239Documents 0.289 0.192 0.226 0.188 - 0.2600.2c + 0.3s + 0.5w 0.314 0.210 0.247 0.207 - 0.284

task1a (tfidf) 0.425 0.264 0.320 0.293 - 0.353task1a (MMR λ = 0.97) 0.418 0.275 0.324 0.299 - 0.351task1a (MMR λ = 0.97 on topics) 0.436 0.272 0.330 0.300 - 0.363task1a (0.2c + 0.3s + 0.5w) 0.439 0.276 0.333 0.308 - 0.358task1a (0.2c + 0.3s + 0.5w on topics) 0.428 0.276 0.330 0.304 - 0.357task1a (0.2c + 0.3s + 0.5w on topics + context 20) 0.451 0.279 0.338 0.312 - 0.366

Table : Rouge-L results of task 2 using the document abstracts as the targetsummaries. The summary size was constrained to 250 words. In boldface isthe best result. In italics are the results within the 95% confidence intervals ofthe best result.

Citing Papers for Summarisation D Molla, C Jones, A Sarker 20/21

Page 25: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Conclusions

Conclusions

1. Used unsupervised methods to find the best fit to a citance.

2. Results of best fit improved as we added data to the tf .idf /SVD modelsand as we added context.

3. Information from the citances can help in extractive summarisation.

Further Work

I If there are more training data, try supervised methods.

I Use annotators to produce target summaries.

I Explore new methods to add further data and context.

Questions?

Citing Papers for Summarisation D Molla, C Jones, A Sarker 21/21

Page 26: Impact of Citing Papers for Summarisation of Clinical Documents

Background and Related Work Finding the Best Fit to a Citance Extracting the Summary

Conclusions

Conclusions

1. Used unsupervised methods to find the best fit to a citance.

2. Results of best fit improved as we added data to the tf .idf /SVD modelsand as we added context.

3. Information from the citances can help in extractive summarisation.

Further Work

I If there are more training data, try supervised methods.

I Use annotators to produce target summaries.

I Explore new methods to add further data and context.

Questions?

Citing Papers for Summarisation D Molla, C Jones, A Sarker 21/21