get to the point: summarization with pointer-generator...

1
Get To The Point: Summarization with Pointer-Generator Networks Abigail See*, Peter J. Liu, Christopher D. Manning * work done partially during an internship at Google Brain Abstractive Summarization with RNNs Recurrent Neural Networks (RNNs) provide a potentially powerful solution for abstractive summarization. Using the attention mechanism (see below), they can generate new words (Germany beat Argentina 2-0) by attending to relevant words (victorious, win) in the source text. Acknowledgements: (i) NVIDIA Corporation (ii) DARPA Deep Exploration and Filtering of Text (DEFT) Program under AFRL contract no. FA8750-13-2-0040 ... Attention Distribution <START> Vocabulary Distribution Context Vector Germany a zoo Partial Summary "beat" Germany emerge victorious in 2-0 win against Argentina on Saturday ... Encoder Hidden States Decoder Hidden States Source Text Two common problems: Easier Copying with Pointer-Generator Network Problem: Factual details (especially rare and OOV words) are copied inaccurately Solution: A hybrid network that can copy via pointing, or generate from a fixed vocabulary Source Text Germany emerge victorious in 2-0 win against Argentina on Saturday ... ... <START> Vocabulary Distribution Context Vector Germany a zoo beat a zoo Partial Summary Final Distribution "Argentina" "2-0" Attention Distribution Encoder Hidden States Decoder Hidden States For each summary word, the network first calculates the generation probability p gen p gen interpolates between copying from attention distribution a and generating from vocabulary distribution P vocab Best of both worlds: abstractive (generating) and extractive (copying) Advantages: Faster to train Easy to accurately reproduce phrases Can copy OOV words (don’t need large vocabulary) Eliminating Repetition with Coverage Problem: Summaries are repetitive. Solution: Penalize repeatedly attending to the same parts of the source text. On each decoder timestep t, the coverage vector c t tells us what has been attended to (thus summarized) so far. Experiments Dataset: CNN/Daily Mail (news article multi-sentence summary) Our pointer-generator + coverage model beats best abstractive system. Example Output Coverage eliminates undesirable repetition How Abstractive Is Our Output? Our network uses the pointer more than the generator (average p gen = 0.17) It produces some novel words and phrases, but fewer than the reference summaries Open question: How to make pointer- generator network more abstractive? 1-grams 2-grams 3-grams 4-grams sentences 0 10 20 30 % that are duplicates pointer-generator, no coverage pointer-generator, with coverage reference summaries Abstractive systems Extractive systems Conclusion Pointer-generator networks enable more accurate copying, are easier to train and can deal with OOVs Coverage drastically reduces repetition ROUGE metric is of limited use for evaluating abstractive systems Future work: make the pointer-generator network more abstractive 1. Summaries repeat themselves (e.g. Germany beat Germany beat Germany beat… ) 2. Summaries reproduce factual details inaccurately (e.g. Germany beat Argentina 3-2 ) Coverage vector Sum of attention distributions so far Penalize overlap between coverage vector c t and new attention distribution a t . Coverage loss function Result: repetition reduced to similar level as reference summaries Extractive systems and lead-3 baseline remain difficult to beat. The ROUGE metric is not robust to paraphrasing Yellow highlight shows final value of the coverage vector (i.e. what has been covered by the final summary) Baseline model produces factual inaccuracies and UNKs Pointer-generator model copies accurately, deals with OOVs but repeats itself Coverage model has no repetition. Green highlighting shows generation probability p gen . Probability that next word is w Probability of generating w from the fixed vocabulary Sum of attention distribution everywhere w appears in the source text Overlap between coverage and current attention The Need for Abstractive Summarization Automatic text summarization is increasingly vital in the digital age. Two approaches: Extractive summarization Select and rearrange passages from the original text Abstractive summarization Generate novel sentences More restrictive Most past work has been extractive Easier to get reasonable performance More expressive Humans are abstractive summarizers More difficult and unpredictable Abstractive summarization is essential for high-quality output 1-grams 2-grams 3-grams 4-grams sentences 0 20 40 60 80 100 % that are novel pointer-generator, with coverage baseline model reference summaries P (w )= p gen P vocab (w ) + (1 - p gen ) X i:w i =w a i c t = P t-1 t 0 =0 a t 0 covloss t = P i min(a t i ,c t i )

Upload: others

Post on 20-May-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Get To The Point: Summarization with Pointer-Generator ...forum.stanford.edu/events/posterslides/GetToThe...• Pointer-generator networks enable more accurate copying, are easier

Get To The Point:Summarization with Pointer-Generator Networks

Abigail See*, Peter J. Liu, Christopher D. Manning* work done partially during an internship at Google Brain

Abstractive Summarization with RNNs• Recurrent Neural Networks (RNNs) provide a potentially powerful solution for

abstractive summarization. • Using the attention mechanism (see below), they can generate new words (Germany beat Argentina 2-0) by attending to relevant words (victorious, win) in the source text.

Acknowledgements: (i) NVIDIA Corporation (ii) DARPA Deep Exploration and Filtering of Text (DEFT) Program under AFRL contract no. FA8750-13-2-0040

...

Atte

ntio

n Di

strib

utio

n

<START>

Vocabulary Distribution

Context Vector

Germany

a zoo

Partial Summary

"beat"

Germany emerge victorious in 2-0 win against Argentina on Saturday ...

Enco

der

Hid

den

Stat

es

DecoderH

idden States

Source Text

Two common problems:

Easier Copying with Pointer-Generator NetworkProblem: Factual details (especially rare and OOV words) are copied inaccurately

Solution: A hybrid network that can copy via pointing, or generate from a fixed vocabulary

Source Text

Germany emerge victorious in 2-0 win against Argentina on Saturday ...

...

<START>

Vocabulary Distribution

Context Vector

Germany

a zoo

beat

a zoo

Partial Summary

Final Distribution

"Argentina"

"2-0"

Atte

ntio

n Di

strib

utio

n

Enco

der

Hid

den

Stat

es

Decoder Hidden States

• For each summary word, the network first calculates the generation probability pgen • pgen interpolates between copying from attention distribution a and generating from

vocabulary distribution Pvocab

Best of both worlds: abstractive (generating) and extractive (copying)

Advantages:• Faster to train • Easy to accurately reproduce phrases • Can copy OOV words (don’t need large vocabulary)

Eliminating Repetition with CoverageProblem: Summaries are repetitive.

Solution: Penalize repeatedly attending to the same parts of the source text.

On each decoder timestep t, the coverage vector ct tells us what has been attended to (thus summarized) so far.

ExperimentsDataset: CNN/Daily Mail (news article → multi-sentence summary)

• Our pointer-generator + coverage model beats best abstractive system.

Example Output

Coverage eliminates undesirable repetition

How Abstractive Is Our Output?• Our network uses the pointer more than the

generator (average pgen = 0.17) • It produces some novel words and phrases,

but fewer than the reference summaries

Open question: How to make pointer-generator network more abstractive?

1

-

g

r

a

m

s

2

-

g

r

a

m

s

3

-

g

r

a

m

s

4

-

g

r

a

m

s

s

e

n

t

e

n

c

e

s

0

10

20

30

%thatareduplicates

pointer-generator, no coverage

pointer-generator, with coverage

reference summaries

1

Abstractive systems

Extractive systems

Conclusion• Pointer-generator networks enable more accurate copying, are easier to

train and can deal with OOVs • Coverage drastically reduces repetition • ROUGE metric is of limited use for evaluating abstractive systems • Future work: make the pointer-generator network more abstractive

1. Summaries repeat themselves (e.g. Germany beat Germany beat Germany beat… ) 2. Summaries reproduce factual details inaccurately (e.g. Germany beat Argentina 3-2 )

Coverage vector Sum of attention distributions so far

Penalize overlap between coverage vector ct and new attention distribution at.

Coverage loss function

Result: repetition reduced to similar level as reference summaries

• Extractive systems and lead-3 baseline remain difficult to beat. • The ROUGE metric is not robust to paraphrasing

Yellow highlight shows final value of the coverage vector (i.e. what has been covered by the final summary)

Baseline model produces factual inaccuracies and UNKs

Pointer-generator model copies accurately, deals with OOVs but repeats itself

Coverage model has no repetition. Green highlighting shows generation probability pgen.

Probability that next word is w

Probability of generating w from the fixed vocabulary Sum of attention

distribution everywhere w appears in the source text

Overlap between coverage and current attention

The Need for Abstractive SummarizationAutomatic text summarization is increasingly vital in the digital age. Two approaches:

Extractive summarizationSelect and rearrange passages from the original text

Abstractive summarizationGenerate novel sentences

• More restrictive • Most past work has been extractive • Easier to get reasonable performance

• More expressive • Humans are abstractive summarizers • More difficult and unpredictable

Abstractive summarization is essential for high-quality output

1

-

g

r

a

m

s

2

-

g

r

a

m

s

3

-

g

r

a

m

s

4

-

g

r

a

m

s

s

e

n

t

e

n

c

e

s

0

20

40

60

80

100

%thatarenovel

pointer-generator, with coverage

baseline model

reference summaries

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

ct =Pt�1

t0=0 at0

covlosst =P

i min(ati, cti)

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

ct =Pt�1

t0=0 at0

covlosst =P

i min(ati, cti)

1