get to the point: summarization with pointer-generator...

Get To The Point:Summarization with Pointer-Generator Networks

Abigail See*, Peter J. Liu, Christopher D. Manning* work done partially during an internship at Google Brain

Abstractive Summarization with RNNs• Recurrent Neural Networks (RNNs) provide a potentially powerful solution for

abstractive summarization. • Using the attention mechanism (see below), they can generate new words (Germany beat Argentina 2-0) by attending to relevant words (victorious, win) in the source text.

Acknowledgements: (i) NVIDIA Corporation (ii) DARPA Deep Exploration and Filtering of Text (DEFT) Program under AFRL contract no. FA8750-13-2-0040

...

Atte

ntio

n Di

strib

utio

n

<START>

Vocabulary Distribution

Context Vector

Germany

a zoo

Partial Summary

"beat"

Germany emerge victorious in 2-0 win against Argentina on Saturday ...

Enco

der

Hid

den

Stat

es

DecoderH

idden States

Source Text

Two common problems:

Easier Copying with Pointer-Generator NetworkProblem: Factual details (especially rare and OOV words) are copied inaccurately

Solution: A hybrid network that can copy via pointing, or generate from a fixed vocabulary

Source Text

Germany emerge victorious in 2-0 win against Argentina on Saturday ...

...

<START>

Vocabulary Distribution

Context Vector

Germany

a zoo

beat

a zoo

Partial Summary

Final Distribution

"Argentina"

"2-0"

Atte

ntio

n Di

strib

utio

n

Enco

der

Hid

den

Stat

es

Decoder Hidden States

• For each summary word, the network first calculates the generation probability pgen • pgen interpolates between copying from attention distribution a and generating from

vocabulary distribution Pvocab

Best of both worlds: abstractive (generating) and extractive (copying)

Advantages:• Faster to train • Easy to accurately reproduce phrases • Can copy OOV words (don’t need large vocabulary)

Eliminating Repetition with CoverageProblem: Summaries are repetitive.

Solution: Penalize repeatedly attending to the same parts of the source text.

On each decoder timestep t, the coverage vector ct tells us what has been attended to (thus summarized) so far.

ExperimentsDataset: CNN/Daily Mail (news article → multi-sentence summary)

• Our pointer-generator + coverage model beats best abstractive system.

Example Output

Coverage eliminates undesirable repetition

How Abstractive Is Our Output?• Our network uses the pointer more than the

generator (average pgen = 0.17) • It produces some novel words and phrases,

but fewer than the reference summaries

Open question: How to make pointer-generator network more abstractive?

1

-

g

r

a

m

s

2

-

g

r

a

m

s

3

-

g

r

a

m

s

4

-

g

r

a

m

s

s

e

n

t

e

n

c

e

s

0

10

20

30

%thatareduplicates

pointer-generator, no coverage

pointer-generator, with coverage

reference summaries

1

Abstractive systems

Extractive systems

Conclusion• Pointer-generator networks enable more accurate copying, are easier to

train and can deal with OOVs • Coverage drastically reduces repetition • ROUGE metric is of limited use for evaluating abstractive systems • Future work: make the pointer-generator network more abstractive

1. Summaries repeat themselves (e.g. Germany beat Germany beat Germany beat… ) 2. Summaries reproduce factual details inaccurately (e.g. Germany beat Argentina 3-2 )

Coverage vector Sum of attention distributions so far

Penalize overlap between coverage vector ct and new attention distribution at.

Coverage loss function

Result: repetition reduced to similar level as reference summaries

• Extractive systems and lead-3 baseline remain difficult to beat. • The ROUGE metric is not robust to paraphrasing

Yellow highlight shows final value of the coverage vector (i.e. what has been covered by the final summary)

Baseline model produces factual inaccuracies and UNKs

Pointer-generator model copies accurately, deals with OOVs but repeats itself

Coverage model has no repetition. Green highlighting shows generation probability pgen.

Probability that next word is w

Probability of generating w from the fixed vocabulary Sum of attention

distribution everywhere w appears in the source text

Overlap between coverage and current attention

The Need for Abstractive SummarizationAutomatic text summarization is increasingly vital in the digital age. Two approaches:

Extractive summarizationSelect and rearrange passages from the original text

Abstractive summarizationGenerate novel sentences

• More restrictive • Most past work has been extractive • Easier to get reasonable performance

• More expressive • Humans are abstractive summarizers • More difficult and unpredictable

Abstractive summarization is essential for high-quality output

1

-

g

r

a

m

s

2

-

g

r

a

m

s

3

-

g

r

a

m

s

4

-

g

r

a

m

s

s

e

n

t

e

n

c

e

s

0

20

40

60

80

100

%thatarenovel

pointer-generator, with coverage

baseline model

reference summaries

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

ct =Pt�1

t0=0 at0

covlosst =P

i min(ati, cti)

1

P (w) = pgen

Pvocab

(w) + (1� pgen

)X

i:wi=w

ai (1)

ct =Pt�1

t0=0 at0

covlosst =P

i min(ati, cti)

1

get to the point: summarization with pointer-generator...

Documents