arxiv:1906.00072v1 [cs.cl] 31 may 2019 · sentence (dis)similarity, then leverage dpp to ob-tain a...

12
Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization Sangwoo Cho, Logan Lebanoff, Hassan Foroosh, Fei Liu Computer Science Department University of Central Florida, Orlando, FL 32816, USA {swcho,loganlebanoff}@knight.ucf.edu, {foroosh,feiliu}@cs.ucf.edu Abstract The most important obstacles facing multi- document summarization include excessive re- dundancy in source descriptions and the loom- ing shortage of training data. These obstacles prevent encoder-decoder models from being used directly, but optimization-based methods such as determinantal point processes (DPPs) are known to handle them well. In this paper we seek to strengthen a DPP-based method for extractive multi-document summarization by presenting a novel similarity measure inspired by capsule networks. The approach measures redundancy between a pair of sentences based on surface form and semantic information. We show that our DPP system with improved sim- ilarity measure performs competitively, out- performing strong summarization baselines on benchmark datasets. Our findings are particu- larly meaningful for summarizing documents created by multiple authors containing redun- dant yet lexically diverse expressions. 1 1 Introduction Multi-document summarization is arguably one of the most important tools for information aggrega- tion. It seeks to produce a succinct summary from a collection of textual documents created by mul- tiple authors concerning a single topic (Nenkova and McKeown, 2011). The summarization tech- nique has seen growing interest in a broad spec- trum of domains that include summarizing prod- uct reviews (Gerani et al., 2014; Yang et al., 2018), student survey responses (Luo and Litman, 2015; Luo et al., 2016), forum discussion threads (Ding and Jiang, 2015; Tarnpradab et al., 2017), and news articles about a particular event (Hong et al., 2014). Despite the empirical success, most of the datasets remain small, and the cost of hiring hu- 1 Our code and data are publicly available at https://github. com/ucfnlp/summarization-dpp-capsnet man annotators to create ground-truth summaries for multi-document inputs can be prohibitive. Impressive progress has been made on neural abstractive summarization using encoder-decoder models (Rush et al., 2015; See et al., 2017; Paulus et al., 2017; Chen and Bansal, 2018). These mod- els, nonetheless, are data-hungry and learn poorly from small datasets, as is often the case with multi- document summarization. To date, studies have primarily focused on single-document summariza- tion (See et al., 2017; Celikyilmaz et al., 2018; Kryscinski et al., 2018) and sentence summariza- tion (Nallapati et al., 2016; Zhou et al., 2017; Cao et al., 2018; Song et al., 2018) in part because par- allel training data are abundant and they can be conveniently acquired from the Web. Further, a notable issue with abstractive summarization is the reliability. These models are equipped with the ca- pability of generating new words not present in the source. With greater freedom of lexical choices, the system summaries can contain inaccurate fac- tual details and falsified content that prevent them from staying “true-to-original.” In this paper we instead focus on an extractive method exploiting the determinantal point pro- cess (DPP; Kulesza and Taskar, 2012) for multi- document summarization. DPP can be trained on small data, and because extractive summaries are free from manipulation, they largely remain true to the original. DPP selects a set of most represen- tative sentences from the given source documents to form a summary, while maintaining high diver- sity among summary sentences. It is one of a fam- ily of optimization-based summarization methods that performed strongest in previous summariza- tion competitions (Gillick and Favre, 2009; Lin and Bilmes, 2010; Kulesza and Taskar, 2011). Diversity is an integral part of the DPP model. It is modelled by pairwise repulsion between sen- tences. In this paper we exploit the capsule net- arXiv:1906.00072v1 [cs.CL] 31 May 2019

Upload: others

Post on 29-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

Improving the Similarity Measure of Determinantal Point Processes forExtractive Multi-Document Summarization

Sangwoo Cho, Logan Lebanoff, Hassan Foroosh, Fei LiuComputer Science Department

University of Central Florida, Orlando, FL 32816, USA{swcho,loganlebanoff}@knight.ucf.edu, {foroosh,feiliu}@cs.ucf.edu

Abstract

The most important obstacles facing multi-document summarization include excessive re-dundancy in source descriptions and the loom-ing shortage of training data. These obstaclesprevent encoder-decoder models from beingused directly, but optimization-based methodssuch as determinantal point processes (DPPs)are known to handle them well. In this paperwe seek to strengthen a DPP-based method forextractive multi-document summarization bypresenting a novel similarity measure inspiredby capsule networks. The approach measuresredundancy between a pair of sentences basedon surface form and semantic information. Weshow that our DPP system with improved sim-ilarity measure performs competitively, out-performing strong summarization baselines onbenchmark datasets. Our findings are particu-larly meaningful for summarizing documentscreated by multiple authors containing redun-dant yet lexically diverse expressions.1

1 Introduction

Multi-document summarization is arguably one ofthe most important tools for information aggrega-tion. It seeks to produce a succinct summary froma collection of textual documents created by mul-tiple authors concerning a single topic (Nenkovaand McKeown, 2011). The summarization tech-nique has seen growing interest in a broad spec-trum of domains that include summarizing prod-uct reviews (Gerani et al., 2014; Yang et al., 2018),student survey responses (Luo and Litman, 2015;Luo et al., 2016), forum discussion threads (Dingand Jiang, 2015; Tarnpradab et al., 2017), andnews articles about a particular event (Hong et al.,2014). Despite the empirical success, most of thedatasets remain small, and the cost of hiring hu-

1Our code and data are publicly available at https://github.com/ucfnlp/summarization-dpp-capsnet

man annotators to create ground-truth summariesfor multi-document inputs can be prohibitive.

Impressive progress has been made on neuralabstractive summarization using encoder-decodermodels (Rush et al., 2015; See et al., 2017; Pauluset al., 2017; Chen and Bansal, 2018). These mod-els, nonetheless, are data-hungry and learn poorlyfrom small datasets, as is often the case with multi-document summarization. To date, studies haveprimarily focused on single-document summariza-tion (See et al., 2017; Celikyilmaz et al., 2018;Kryscinski et al., 2018) and sentence summariza-tion (Nallapati et al., 2016; Zhou et al., 2017; Caoet al., 2018; Song et al., 2018) in part because par-allel training data are abundant and they can beconveniently acquired from the Web. Further, anotable issue with abstractive summarization is thereliability. These models are equipped with the ca-pability of generating new words not present in thesource. With greater freedom of lexical choices,the system summaries can contain inaccurate fac-tual details and falsified content that prevent themfrom staying “true-to-original.”

In this paper we instead focus on an extractivemethod exploiting the determinantal point pro-cess (DPP; Kulesza and Taskar, 2012) for multi-document summarization. DPP can be trained onsmall data, and because extractive summaries arefree from manipulation, they largely remain trueto the original. DPP selects a set of most represen-tative sentences from the given source documentsto form a summary, while maintaining high diver-sity among summary sentences. It is one of a fam-ily of optimization-based summarization methodsthat performed strongest in previous summariza-tion competitions (Gillick and Favre, 2009; Linand Bilmes, 2010; Kulesza and Taskar, 2011).

Diversity is an integral part of the DPP model.It is modelled by pairwise repulsion between sen-tences. In this paper we exploit the capsule net-

arX

iv:1

906.

0007

2v1

[cs

.CL

] 3

1 M

ay 2

019

Page 2: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

works (Hinton et al., 2018) to measure pairwisesentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity scoresbased on the bag-of-words representation of sen-tences (Kulesza and Taskar, 2011) and with kernelmethods (Gong et al., 2014). These methods, how-ever, are incapable of capturing lexical and syntac-tic variations in the sentences (e.g., paraphrases),which are ubiquitous in multi-document summa-rization data as the source documents are createdby multiple authors with distinct writing styles.We hypothesize that the recently proposed capsulenetworks, which learn high-level representationsbased on the orientational and spatial relationshipsof low-level components, can be a suitable supple-ment to model pairwise sentence similarity.

Importantly, we argue that predicting sentencesimilarity within the context of summarization hasits uniqueness. It estimates if two sentences con-tain redundant information based on both sur-face word form and their underlying semantics.E.g., the two sentences “Snowstorm slams east-ern US on Friday” and “A strong wintry storm wasdumping snow in eastern US after creating traffichavoc that claimed at least eight lives” are con-sidered similar because they carry redundant in-formation and cannot both be included in the sum-mary. These sentences are by no means seman-tically equivalent, nor do they exhibit a clear en-tailment relationship. The task thus should be dis-tinguished from similar tasks such as predictingnatural language inference (Bowman et al., 2015;Williams et al., 2018) or semantic textual similar-ity (Cer et al., 2017). In this work, we describe anovel method to collect a large amount of sentencepairs that are deemed similar for summarizationpurpose. We contrast this new dataset with thoseused for textual entailment for modeling sentencesimilarity and demonstrate its effectiveness on dis-criminating sentences and generating diverse sum-maries. The contributions of this work can be sum-marized as follows:

• we present a novel method inspired by the de-terminantal point process for multi-documentsummarization. The method includes a diver-sity measure assessing the redundancy betweensentences, and a quality measure that indicatesthe importance of sentences. DPP extracts a setof summary sentences that are both representa-tive of the document set and remain diverse;

• we present the first study exploiting capsule net-works for determining sentence similarity forsummarization purpose. It is important to rec-ognize that summarization places particular em-phasis on measuring redundancy between sen-tences; and this notion of similarity is differ-ent from that of entailment and semantic textualsimilarity (STS);

• our findings suggest that effectively modelingpairwise sentence similarity is crucial for in-creasing summary diversity and boosting sum-marization performance. Our DPP system withimproved similarity measure performs com-petitively, outperforming strong summarizationbaselines on benchmark datasets.

2 Related Work

Extractive summarization approaches are the mostpopular in real-world applications (Carbonell andGoldstein, 1998; Daume III and Marcu, 2006;Galanis and Androutsopoulos, 2010; Hong et al.,2014; Yogatama et al., 2015). These approachesfocus on identifying representative sentences froma single document or set of documents to forma summary. The summary sentences can be op-tionally compressed to remove unimportant con-stituents such as prepositional phrases to yield asuccinct summary (Knight and Marcu, 2002; Za-jic et al., 2007; Martins and Smith, 2009; Berg-Kirkpatrick et al., 2011; Thadani and McKeown,2013; Wang et al., 2013; Li et al., 2013, 2014;Filippova et al., 2015; Durrett et al., 2016). Ex-tractive summarization methods are mostly unsu-pervised or lightly-supervised using thousands oftraining examples. Given its practical importance,we explore an extractive method in this work formulti-document summarization.

It is not uncommon to cast summarization as adiscrete optimization problem (Gillick and Favre,2009; Takamura and Okumura, 2009; Lin andBilmes, 2010; Hirao et al., 2013). In this formu-lation, a set of binary variables are used to indi-cate whether their corresponding source sentencesare to be included in the summary. The summarysentences are selected to maximize the coverageof important source content, while minimizing thesummary redundancy and subject to a length con-straint. The optimization can be performed usingan off-the-shelf tool such as Gurobi, IBM CPLEX,or via a greedy approximation algorithm. Notableoptimization frameworks include integer linear

Page 3: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

programming (Gillick and Favre, 2009), determi-nantal point processes (Kulesza and Taskar, 2012),submodular functions (Lin and Bilmes, 2010), andminimum dominating set (Shen and Li, 2010). Inthis paper we employ the DPP framework becauseof its remarkable performance on various summa-rization problems (Zhang et al., 2016).

Recent years have also seen considerable in-terest in neural approaches to summarization. Inparticular, neural extractive approaches focus onlearning vector representations of source sen-tences; then based on these representations theydetermine if a source sentence is to be included inthe summary (Cheng and Lapata, 2016; Yasunagaet al., 2017; Nallapati et al., 2017; Narayan et al.,2018). Neural abstractive approaches usually in-clude an encoder used to convert the entire sourcedocument to a continuous vector, and a decoder forgenerating an abstract word by word conditionedon the document vector (Paulus et al., 2017; Tanet al., 2017; Guo et al., 2018; Kedzie et al., 2018).These neural models, however, require large train-ing data containing hundreds of thousands to mil-lions of examples, which are still unavailable forthe multi-document summarization task. To date,most neural summarization studies are performedfor single document summarization.

Extracting summary-worthy sentences from thesource documents is important even if the ulti-mate goal is to generate abstracts. Recent abstrac-tive studies recognize the importance of separating“salience estimation” from “text generation” so asto reduce the amount of training data required byencoder-decoder models (Gehrmann et al., 2018;Lebanoff et al., 2018, 2019). An extractive methodis often leveraged to identify salient source sen-tences, then a neural text generator rewrites theselected sentences into an abstract. Our pursuit ofthe DPP method is especially meaningful in thiscontext. As described in the next section, DPP hasan extraordinary ability to distinguish redundantdescriptions, thereby avoiding passing redundantcontent to the abstractor that can cause an encoder-decoder model to fail.

3 The DPP Framework

Let Y = {1, 2, · · · ,N} be a ground set contain-ing N items, corresponding to all sentences of thesource documents. Our goal is to identify a subsetof items Y ⊆ Y that forms an extractive summaryof the document set. A determinantal point pro-

cess (DPP; Kulesza and Taskar, 2012) defines aprobability measure over all subsets of Y s.t.

P(Y ;L) =det(LY )

det(L+ I), (1)∑

Y⊆Ydet(LY ) = det(L+ I), (2)

where det(·) is the determinant of a matrix; I is theidentity matrix; L ∈ RN×N is a positive semidef-inite matrix, known as the L-ensemble; Lij mea-sures the correlation between sentences i and j;and LY is a submatrix of L containing only entriesindexed by elements of Y . Finally, the probabilityof an extractive summary Y ⊆ Y is proportionalto the determinant of the matrix LY (Eq. (1)).

Kulesza and Taskar (2012) provide a decompo-sition of the L-ensemble matrix: Lij = qi ·Sij · qjwhere qi ∈ R+ is a positive real number indicatingthe quality of a sentence; and Sij is a measure ofsimilarity between sentences i and j. This formu-lation separately models the sentence quality andpairwise similarity before combining them into aunified model. Let Y = {i, j} be a summary con-taining only two sentences i and j, its probabilityP(Y ;L) can be computed as

P(Y = {i, j};L) ∝ det(LY )

=

∣∣∣∣qiSiiqi qiSijqjqjSjiqi qjSjjqj

∣∣∣∣= q2i · q2j · (1− S2

ij). (3)

Eq. (3) indicates that, if sentence i is of high qual-ity, denoted by qi, then any summary containingit will have high probability. If two sentences iand j are similar to each other, denoted by Sij ,then any summary containing both sentences willhave low probability. The summary Y achievingthe highest probability thus should contain a setof high-quality sentences while maintaining highdiversity among the selected sentences (via pair-wise repulsion). det(LY ) also has a particular ge-ometric interpretation as the squared volume of thespace spanned by sentence vectors i and j, wherethe quality measure indicates the length of the vec-tor and the similarity indicates the angle betweentwo vectors (Figure 1).

We adopt a feature-based approach to computesentence quality: qi = exp(θ>xi). In particu-lar, xi is a feature vector for sentence i and θ arethe feature weights to be learned during training.Kulesza and Taskar (2011) define sentence simi-larity as Si,j = φ>i φj , where ‖φi‖2 = 1 (∀i) is

Page 4: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

(quality)

Sij

qi

qj

(similarity)

P(Y ; L) / q2i q2

j (1� S2ij)

Figure 1: The DPP model specifies the probability of a sum-mary P(Y = {i, j};L) to be proportional to the squaredvolume of the space spanned by sentence vectors i and j.

a sentence TF-IDF vector. The model parametersθ are optimized by maximizing the log-likelihoodof training data (Eq. (4)) and this objective can beoptimized efficiently with subgradient descent.2

θ=argmaxθ

M∑m=1

logP(Y (m);L(Y(m);θ)) (4)

During training, we create the ground-truth ex-tractive summary (Y ) for a document set basedon human reference summaries (abstracts) usingthe following procedure. At each iteration we se-lect a source sentence sharing the longest com-mon subsequence with the human reference sum-maries; the shared words are then removed fromhuman summaries to avoid duplicates in future se-lection. Similar methods are exploited by Nal-lapati et al. (2017) and Narayan et al. (2018) tocreate ground-truth extractive summaries. At testtime, we perform inference using the learned DPPmodel to obtain a system summary (Y ). We imple-ment a greedy method (Kulesza and Taskar, 2012)to iteratively add a sentence to the summary so thatP(Y ;L) yields the highest probability (Eq. (1)),until a summary length limit is reached.

For the DPP framework to be successful, thesentence similarity measure (Sij) has to accuratelycapture if any two sentences contain redundant in-formation. This is especially important for multi-document summarization as redundancy is ubiqui-tous in source documents. The source descriptionsfrequently contain redundant yet lexically diverseexpressions such as sentential paraphrases wherepeople write about the same event using distinctstyles (Hu et al., 2019). Without accurately mod-elling sentence similarity, redundant content canmake their way into the summary and further pre-vent useful information from being included giventhe summary length limit. Existing cosine similar-ity measure between sentence TF-IDF vectors can

2The sentence features include the length and position ofa sentence, the cosine similarity between sentence and docu-ment TF-IDF vectors (Kulesza and Taskar, 2011). We refrainfrom using sophisticated features to avoid model overfitting.

be incompetent in modeling semantic relatedness.In the following section we exploit the recently in-troduced capsule networks (Hinton et al., 2018) tomeasure pairwise sentence similarity; it considersif two sentences share any words in common andmore importantly the semantic closeness of sen-tence descriptions.

4 An Improved Similarity Measure

Our goal is to develop an advanced similarity mea-sure for pairs of sentences such that semanticallysimilar sentences can receive high scores despitethat they have very few words in common. E.g.,“Snowstorm slams eastern US on Friday” and “Astrong wintry storm was dumping snow in east-ern US after creating traffic havoc that claimed atleast eight lives” have only two words in common.Nonetheless, they contain redundant informationand cannot both be included in the summary.

Let {xa,xb} ∈ RE×L denote two sentences aand b. Each consists of a sequence of word em-beddings, where E is the embedding size and Lis the sentence length with zero-padding to theright for shorter sentences. A convolutional layerwith multiple filter sizes is first applied to eachsentence to extract local features (Eq. (5)), wherexai:i+k−1 ∈ RkE denotes a flattened embedding for

position i with a filter size k, and uai,k ∈ Rd is the

resulting local feature for position i; f is a non-linear activation function (e.g., ReLU); {Wu,bu}are model parameters.

uai,k = f(Wuxa

i:i+k−1 + bu) (5)

We use uai ∈ RD to denote the concatenation of

local features generated using various filter sizes.Following Kim et al. (2014), we employ filter sizesk ∈ {3, 4, 5, 6, 7} with an equal number of filters(d) for each size (D = 5d). After applying max-pooling to local features of all positions, we obtaina representation ua = max-pooling(ua

i ) ∈ RD

for sentence a; and similarly we obtain ub ∈ RD

for sentence b. It is not uncommon for state-of-the-art sentence similarity classifiers (Chen et al.,2018) to concatenate the two sentence vectors,their absolute difference and element-wise prod-uct [ua;ub; |ua − ub|;ua ◦ ub], and feed this rep-resentation to a fully connected layer to predict iftwo sentences are similar.

Nevertheless, we conjecture that such represen-tation may be insufficient to fully characterize therelationship between components of the sentences

Page 5: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

EL

EL

d5

d3

d4

d6

d

7

5dL

5d

Capsnet

BM

5dL

5d

L1

L

5dL

2561

1

2561

1

LSTM-256

50K1

1

4×5d

+2L

+B

1

1001

1

11

1

Figure 2: The system architecture utilizing CapsNet for predicting sentence similarity. denotes the inputs and intermediateoutputs; the convolutional layer; max-pooling layer; fully-connected layer; and ReLU activation.

in order to model sentence similarity. For exam-ple, the term “snowstorm” in sentence a is se-mantically related to “wintry storm” and “dump-ing snow” in sentence b; this low-level interactionindicates that the two sentences contain redundantinformation and it cannot be captured by the abovemodel. Importantly, the capsule networks pro-posed by Hinton et al. (2018) are designed to char-acterize the spatial and orientational relationshipsbetween low-level components. We thus seek toexploit CapsNet to strengthen the capability of oursystem for identifying redundant sentences.

Let {uai ,u

bi }Li=1 ∈ RD be low-level representa-

tions (i.e., capsules). We seek to transform themto high-level capsules {vj}Mj=1 ∈ RB that char-acterize the interaction between low-level compo-nents. Each low-level capsule ui ∈ RD is multi-plied by a linear transformation matrix to dedicatea portion of it, denoted by uj|i ∈ RB, to the con-struction of a high-level capsule j (Eq. (6)); where{Wv

ij} ∈ RD×B are model parameters. To re-duce parameters and prevent overfitting, we fur-ther encourage sharing parameters over all low-level capsules, yielding Wv

1j = Wv2j = · · · , and

the same parameter sharing is described in (Zhaoet al., 2018). By computing the weighted sum ofuj|i, whose weights cij indicate the strength of in-teraction between a low-level capsule i and a high-level capsule j, we obtain an (unnormalized) cap-sule (Eq. (7)); we then apply a nonlinear squashfunction g(·) to normalize the length the vector tobe less than 1, yielding vj ∈ RB.

uj|i = Wvijui (6)

vj = g(∑

i

cijuj|i)

(7)

Routing (Sabour et al., 2017; Zhao et al., 2019)

aims to adjust the interaction weights (cij) usingan iterative, EM-like method. Initially, we set{bij} to be zero for all i and j. Per Eq. (8), cibecomes a uniform distribution indicating a low-level capsule i contributes equally to all its upperlevel capsules. After computing uj|i and vj usingEq. (6-7), the weights bij are updated according tothe strength of interaction (Eq. (9)). If uj|i agreeswith a capsule vj , their interaction weight will beincreased, and decreased otherwise. This processis repeated for r iterations to stabilize cij .

ci ← softmax(bi) (8)

bij ← bij + uj|ivj (9)

The high-level capsules {vj}Mj=1 effectively en-code spatial and orientational relationships of low-level capsules. To identify the most prominent in-teractions, we apply max-pooling to all high-levelcapsules to produce v = max-poolingj(vj) ∈ RB.This representation v, aimed to encode interac-tions between sentences a and b, is concatenatedwith [ua;ub; |ua−ub|;ua◦ub] and binary vectors[za; zb] that indicate if any word in sentence a ap-pears in sentence b and vice versa; they are used asinput to a fully connected layer to predict if a pairof sentences contain redundant information. Ourloss function contains two components, includinga binary cross-entropy loss indicating whether theprediction is correct or not, and a reconstructionloss for reconstructing a sentence a conditioned onua by predicting one word at a time using a re-current neural network, and similarly for sentenceb. A hyperparameter λ is used to balance contri-butions from both sides. In Figure 2 we presentan overview of the system architecture, and hyper-parameters are described in the supplementary.

Page 6: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

5 Datasets

To our best knowledge, there is no dataset focusingon determining if two sentences contain redundantinformation. It is a nontrivial task in the context ofmulti-document summarization. Further, we arguethat the task should be distinguished from othersemantic similarity tasks: semantic textual simi-larity (STS; Cer et al., 2017) assesses to what de-gree two sentences are semantically equivalent toeach other; natural language inference (NLI; Bow-man et al., 2015) determines if one sentence (“hy-pothesis”) can be semantically inferred from theother sentence (“premise”). Nonetheless, redun-dant sentences found in a set of source documentsdiscussing a particular topic are not necessarily se-mantically equivalent or express an entailment re-lationship. We compare different datasets in §6.

Sentence redundancy dataset A novel datasetcontaining over 2 million sentence pairs is intro-duced in this paper for sentence redundancy pre-diction. We hypothesize that it is likely for a sum-mary sentence and its most similar source sentenceto contain redundant information. Because hu-mans create summaries using generalization, para-phrasing, and other high-level text operations, asummary sentence and its source sentence can besemantically similar, yet contain diverse expres-sions. Fortunately, such source/summary sentencepairs can be conveniently derived from single-document summarization data. We analyze theCNN/Daily Mail dataset (Hermann et al., 2015)that contains a massive collection of single newsarticles and their human-written summaries. Foreach summary sentence, we identify its most sim-ilar source sentence by calculating the averagedR-1, R-2, and R-L F-scores (Lin, 2004) betweena source and summary sentences. We consider asummary sentence to have no match if the scoreis lower than a threshold. We obtain negative ex-amples by randomly sampling two sentences froma news article. In total, our training / dev / testsets contain 2,084,798 / 105,936 / 86,144 sentencepairs and we make the dataset available to advanceresearch on sentence redundancy.

Summarization datasets We evaluate our DPP-based system on benchmark multi-document sum-marization datasets. The task is to create a suc-cinct summary with up to 100 words from a clus-ter of 10 news articles discussing a single topic.The DUC and TAC datasets (Over and Yen, 2004;

DUC-04System R-1 R-2 R-SU4

Opinosis (Ganesan et al., 2010) 27.07 5.03 8.63Extract+Rewrite (Song et al., 2018) 28.90 5.33 8.76Pointer-Gen (See et al., 2017) 31.43 6.03 10.01SumBasic (Vanderwende et al., 2007) 29.48 4.25 8.64KLSumm (Haghighi et al., 2009) 31.04 6.03 10.23LexRank (Erkan and Radev, 2004) 34.44 7.11 11.19Centroid (Hong et al., 2014) 35.49 7.80 12.02ICSISumm (Gillick and Favre, 2009) 37.31 9.36 13.12DPP (Kulesza and Taskar, 2011)† 38.10 9.14 13.40DPP-Capsnet (this work) 38.25 9.22 13.40DPP-Combined (this work) 39.35 10.14 14.15

Table 1: ROUGE results on the DUC-04 dataset. † indicatesour reimplementation of Kulesza and Taskar (2011) system.

Dang and Owczarzak, 2008) have been used inprevious summarization competitions. In thispaper we use DUC-03/04 and TAC-08/09/10/11datasets that contain 60/50/48/44/46/44 documentclusters respectively. Four human reference sum-maries have been created for each document clus-ter by NIST assessors. Any system summaries areevaluated against human reference summaries us-ing the ROUGE software (Lin, 2004)3, where R-1, -2, and -SU4 respectively measure the overlapof unigrams, bigrams, unigrams and skip bigramswith a maximum distance of 4 words. We reportresults on DUC-04 (trained on DUC-03) and TAC-11 (trained on TAC-08/09/10) that are often usedas standard test sets (Hong et al., 2014).

6 Experimental Results

In this section we discuss results that we obtainedfor multi-document summarization and determin-ing redundancy between sentences.

6.1 Summarization Results

We compare our system with a number of strongsummarization baselines (Table 1 and 2). In par-ticular, SumBasic (Vanderwende et al., 2007) is anextractive approach assuming words occurring fre-quently in a document cluster are more likely to beincluded in the summary; KL-Sum (Haghighi andVanderwende, 2009) is a greedy approach addinga sentence to the summary to minimize KL diver-gence; and LexRank (Erkan and Radev, 2004) is agraph-based approach computing sentence impor-tance based on eigenvector centrality.

We additionally consider abstractive baselinesto illustrate how well these systems perform on

3w/ options -n 2 -m -w 1.2 -c 95 -r 1000 -l 100

Page 7: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

TAC-11System R-1 R-2 R-SU4

Opinosis (Ganesan et al., 2010) 25.15 5.12 8.12Extract+Rewrite (Song et al., 2018) 29.07 6.11 9.20Pointer-Gen (See et al., 2017) 31.44 6.40 10.20SumBasic (Vanderwende et al., 2007) 31.58 6.06 10.06KLSumm (Haghighi et al., 2009) 31.23 7.07 10.56LexRank (Erkan and Radev, 2004) 33.10 7.50 11.13DPP (Kulesza and Taskar, 2011)† 36.95 9.83 13.57DPP-Capsnet (this work) 36.61 9.30 13.09DPP-Combined (this work) 37.30 10.13 13.78

Table 2: ROUGE results on the TAC-11 dataset.

multi-document summarization: Opinosis (Gane-san et al., 2010) focuses on creating a word co-occurrence graph from the source documents andsearching for salient graph paths to create an ab-stract; Extract+Rewrite (Song et al., 2018) selectssentences using LexRank and condenses each sen-tence to a title-like summary; Pointer-Gen (Seeet al., 2017) seeks to generate abstracts by copyingwords from the source documents and generatingnovel words not present in the source text.

Our DPP-based framework belongs to a strandof optimization-based methods. In particular, IC-SISumm (Gillick et al., 2009) formulates extractivesummarization as integer linear programming; itidentifies a globally-optimal set of sentences cov-ering the most important concepts of the sourcedocuments; DPP (Kulesza and Taskar, 2011) se-lects an optimal set of sentences that are represen-tative of the source documents and with maximumdiversity, as determined by the determinantal pointprocess. Gong et al. (2014) show that the DPP per-forms well on summarizing both text and video.

We experiment with several variants of the DPPmodel: DPP-Capsnet computes the similarity be-tween sentences (Sij) using the CapsNet describedin Sec. §4 and trained using our newly-constructedsentence redundancy dataset, whereas the defaultDPP framework computes sentence similarity asthe cosine similarity of sentence TF-IDF vectors.DPP-Combined linearly combines the cosine sim-ilarity with the CapsNet output using an interpola-tion coefficient determined on the dev set4.

Table 1 and 2 illustrate the summarization re-sults we have obtained for the DUC-04 and TAC-11 datasets. Our DPP methods perform superior toboth extractive and abstractive baselines, indicat-ing the effectiveness of optimization-based meth-

4The Capsnet coefficient λc is selected to be 0.2 and 0.1respectively for the DUC-04 and TAC-11 dataset.

ods for extractive multi-document summarization.The DPP optimizes for summary sentence selec-tion to maximize their content coverage and diver-sity, expressed as the squared volume of the spacespanned by the selected sentences.

Further, we observe that the DPP system withcombined similarity metrics yields the highest per-formance, achieving 10.14% and 10.13% F-scoresrespectively on DUC-04 and TAC-11. This find-ing suggests that the cosine similarity of sentenceTF-IDF vectors and the CapsNet semantic similar-ity successfully complement each other to providethe best overall estimate of sentence redundancy.A close examination of the system outputs revealthat important topical words (e.g., “$3 million”)that are frequently discussed in the document clus-ter can be crucial for determining sentence redun-dancy, because sentences sharing the same topi-cal words are more likely to be considered redun-dant. While neural models such as the CapsNetrarely explicitly model word frequencies, the TF-IDF sentence representation is highly effective incapturing topical terms.

In Table 3 we show example system summariesand a human-written reference summary. We ob-serve that LexRank tends to extract long and com-prehensive sentences that yield high graph central-ity; the abstractive pointer-generator networks, de-spite the promising results, can sometimes fail togenerate meaningful summaries (e.g., “a third ofall 3-year-olds · · · have been given to a child”).In contrast, our DPP method is able to select a bal-anced set of representative and diverse summarysentences. We next compare several semantic sim-ilarity datasets to gain a better understanding ofmodeling sentence redundancy for summarization.

6.2 Sentence Similarity

We compare three standard datasets used for se-mantic similarity tasks, including SNLI (Bowmanet al., 2015), used for natural language inference,STS-Benchmark (Cer et al., 2017) for semanticequivalence, and our newly-constructed Src-Summsentence pairs. Details are presented in Table 4.

We observe that CapsNet achieves the highestprediction accuracy of 94.8% on the Src-Summdataset and it yields similar performance on SNLI,indicating the effectiveness of CapsNet on char-acterizing semantic similarity. STS appears to bea more challenging task, where CapsNet yields64.7% accuracy. Note that we perform two-way

Page 8: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

LexRank Summary

• The official, Dr. Charles J. Ganley, director of the office of nonpre-scription drug products at the Food and Drug Administration, saidin an interview that the agency was “revisiting the risks and benefitsof the use of these drugs in children” and that “we’re particularlyconcerned about the use of these drugs in children less than 2 yearsof age.”

• The Consumer Healthcare Products Association, an industry tradegroup that has consistently defended the safety of pediatric coughand cold medicines, recommended in its own 156-page safety re-view, also released Friday, that the FDA consider mandatory warn-ing labels saying that they should not be used in children youngerthan two.

• Major makers of over-the-counter infant cough and coldmedicines announced Thursday that they were voluntarily with-drawing their products from the market for fear that they could bemisused by parents.

Pointer-Gen Summary

• Dr. Charles Ganley, a top food and drug administration official,said the agency was “revisiting the risks and benefits of the use ofthese drugs in children,” the director of the FDA’s office of nonpre-scription drug products.

• The FDA will formally consider revising labeling at a meetingscheduled for Oct. 18-19.

• The withdrawal comes two weeks after reviewing reports of sideeffects over the last four decades, a 1994 study found that more thana third of all 3-year-olds in the United States were estimated to havebeen given to a child.

DPP-Combined Summary

• Johnson & Johnson on Thursday voluntarily recalled certain infantcough and cold products, citing ”rare” instances of misuse leading tooverdoses.

• Federal drug regulators have started a broad review of the safety ofpopular cough and cold remedies meant for children, a top official saidThursday.

• Safety experts for the Food and Drug Administration urged theagency on Friday to consider an outright ban on over-the-counter,multi-symptom cough and cold medicines for children under 6.

• Major makers of over-the-counter infant cough and cold medicinesannounced Thursday that they were voluntarily withdrawing theirproducts from the market for fear that they could be misused byparents.

Human Reference Summary

• On March 1, 2007, the Food/Drug Administration (FDA) started abroad safety review of children’s cough/cold remedies.

• They are particularly concerned about use of these drugs by infants.

• By September 28th, the 356-page FDA review urged an outright banon all such medicines for children under six.

• Dr. Charles Ganley, a top FDA official said “We have no data on theseagents of what’s a safe and effective dose in Children.” The reviewalso stated that between 1969 and 2006, 123 children died from takingdecongestants and antihistimines.

• On October 11th, all such infant products were pulled from themarkets.

Table 3: Example system summaries and the human reference summary. LexRank extracts long and comprehensive sentencesthat yield high graph centrality. Pointer-Gen (abstractive) has difficulty in generating faithful summaries (see the last bullet “all3-year-olds ... have been given to a child”). DPP is able to select a balanced set of representative and diverse sentences.

Dataset Train Dev Test Accu.

STS-Benchmark(Cer et al., 2017)

5,749 1,500 1,379 64.7%

SNLI(Bowman et al., 2015)

366,603 6,607 6,605 93.0%

Src-Summ Pairs(this work) 2,084,798 105,936 86,144 94.8%

Table 4: Several sentence similarity datasets, and the per-formance of CapsNet on them. SNLI discriminates betweenentailment and contradiction; STS is pretrained using Src-Summ pairs and fine-tuned on train split due to its small size.

classification on SNLI to discriminate entailmentand contradiction. The STS dataset is too small tobe used to train CapsNet without overfitting, wethus pre-train the model on Src-Summ pairs, anduse the train split of STS to fine-tune parameters.

Table 5 shows example positive and negativesentence pairs from the STS, SNLI, and Src-Summdatasets. The STS and SNLI datasets are con-structed by human annotators to test a system’s ca-pability of learning sentence representations. Thesentences can share very few words in commonbut still express an entailment relationship (posi-tive); or the sentences can share a lot of words incommon yet they are semantically distinct (neg-ative). These cases are usually not seen in sum-marization datasets containing clusters of docu-

STS-Benchmark (a) Four girls happily walk down a side-walk. (b) Three young girls walk down a sidewalk. 7

SNLI (a) 3 young man in hoods standing in the middle ofa quiet street facing the camera. (b) Three hood wearingpeople pose for a picture. 3

Src-Summ Pairs (a) He ended up killing five girls andwounding five others before killing himself. (b) Nearly fourmonths ago, a milk delivery-truck driver lined up 10 girlsin a one-room schoolhouse in this Amish farming commu-nity and opened fire, killing five of them and wounding fiveothers before turning the gun on himself. 3

Table 5: Example positive (3) and negative (7) sentencepairs from the semantic similarity datasets.

ments discussing single topics. The Src-Summdataset successfully strike a balance between shar-ing common words yet containing diverse expres-sions. It is thus a good fit for training classifiers todetect sentence redundancy.

Figure 3 compares heatmaps generated by com-puting cosine similarity of sentence TF-IDF vec-tors (Cosine), and training CapsNet on SNLI andSrc-Summ datasets respectively. We find that theCosine similarity scores are relatively strict, as avast majority of sentence pairs are assigned zerosimilarity, because these sentences have no wordoverlap. At the other extreme, CapsNet+SNLI la-bels a large quantity of sentence pairs as false pos-itives, because its training data frequently contain

Page 9: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

Figure 3: Heatmaps for topic D31008 of DUC-04 (croppedto 200 sentences). It shows the cosine similarity of sen-tence TF-IDF vectors (Cosine, left), and the CapsNet outputtrained respectively on SNLI (right) and Src-Summ (middle)datasets. The short off-diagonal lines are near-identical sen-tences found in the document cluster.

sentences that share few words in common butnonetheless are positive, i.e., expressing an entail-ment relationship. The similarity scores generatedby CapsNet+SrcSumm are more moderate com-paring to CapsNet+SNLI and Cosine, suggestingthe appropriateness of using Src-Summ sentencepairs for estimating sentence redundancy.

7 Conclusion

We strengthen a DPP-based multi-document sum-marization system with improved similarity mea-sure inspired by capsule networks for determin-ing sentence redundancy. We show that redun-dant sentences not only have common words butthey can be semantically similar with little wordoverlap. Both aspects should be modelled in cal-culating pairwise sentence similarity. Our systemyields competitive results on benchmark datasetssurpassing strong summarization baselines.

Acknowledgments

The authors are grateful to the reviewers for theirinsightful feedback. We would also like to extendour thanks to Boqing Gong, Xiaodan Zhu and FeiSha for useful discussions.

ReferencesTaylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein.

2011. Jointly learning to extract and compress. InProceedings of the Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact aware neural abstrac-tive summarization. In Proceedings of the AAAIConference on Artificial Intelligence (AAAI).

Jaime Carbonell and Jade Goldstein. 1998. The useof MMR, diversity-based reranking for reorderingdocuments and producing summaries. In Proceed-ings of the International ACM SIGIR Conference onResearch and Development in Information Retrieval(SIGIR).

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agentsfor abstractive summarization. In Proceedings ofthe North American Chapter of the Association forComputational Linguistics (NAACL).

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, , and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcross-lingual focused evaluation. In Proceedings ofthe 11th International Workshop on Semantic Eval-uations (SemEval).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowl-edge. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL).

Yen-Chun Chen and Mohit Bansal. 2018. Fast ab-stractive summarization with reinforce-selected sen-tence rewriting. In Proceedings of the Annual Meet-ing of the Association for Computational Linguistics(ACL).

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of ACL.

Hoa Trang Dang and Karolina Owczarzak. 2008.Overview of the TAC 2008 update summarizationtask. In Proceedings of Text Analysis Conference(TAC).

Hal Daume III and Daniel Marcu. 2006. Bayesianquery-focused summarization. In Proceedings ofthe 44th Annual Meeting of the Association for Com-putational Linguistics (ACL).

Ying Ding and Jing Jiang. 2015. Towards opinionsummarization from online forums. In Proceedingsof the International Conference Recent Advances inNatural Language Processing (RANLP).

Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.2016. Learning-based single-document summariza-tion with compression and anaphoricity constraints.In Proceedings of the Association for ComputationalLinguistics (ACL).

Gunes Erkan and Dragomir R. Radev. 2004. LexRank:Graph-based lexical centrality as salience in textsummarization. Journal of Artificial IntelligenceResearch.

Katja Filippova, Enrique Alfonseca, Carlos Col-menares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).

Page 10: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

Dimitrios Galanis and Ion Androutsopoulos. 2010. Anextractive supervised two-stage method for sentencecompression. In Proceedings of NAACL-HLT.

Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.2010. Opinosis: A graph-based approach to ab-stractive summarization of highly redundant opin-ions. In Proceedings of the International Confer-ence on Computational Linguistics (COLING).

Sebastian Gehrmann, Yuntian Deng, and Alexander M.Rush. 2018. Bottom-up abstractive summariza-tion. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Shima Gerani, Yashar Mehdad, Giuseppe Carenini,Raymond T. Ng, and Bita Nejat. 2014. Abstractivesummarization of product reviews using discoursestructure. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theNAACL Workshop on Integer Linear Programmingfor Natural Langauge Processing.

Boqing Gong, Wei-Lun Chao, Kristen Grauman, andFei Sha. 2014. Diverse sequential subset selectionfor supervised video summarization. In Proceedingsof Neural Information Processing Systems (NIPS).

Han Guo, Ramakanth Pasunuru, and Mohit Bansal.2018. Soft, layer-specific multi-task summarizationwith entailment and question generation. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics (ACL).

Aria Haghighi and Lucy Vanderwende. 2009. Explor-ing content models for multi-document summariza-tion. In Proceedings of the North American Chap-ter of the Association for Computational Linguistics(NAACL).

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proceedings ofNeural Information Processing Systems (NIPS).

Geoffrey Hinton, Sara Sabour, and Nicholas Frosst.2018. Matrix capsules with EM routing. In Pro-ceedings of the International Conference on Learn-ing Representations (ICLR).

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino,Norihito Yasuda, and Masaaki Nagata. 2013.Single-document summarization as a tree knapsackproblem. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Kai Hong, John M Conroy, Benoit Favre, AlexKulesza, Hui Lin, and Ani Nenkova. 2014. A repos-itory of state of the art and competitive baseline sum-maries for generic news summarization. In Proceed-ings of the Ninth International Conference on Lan-guage Resources and Evaluation (LREC).

J. Edward Hu, Rachel Rudinger, Matt Post, and Ben-jamin Van Durme. 2019. ParaBank: Monolingualbitext generation and sentential paraphrasing vialexically-constrained neural machine translation. InProceedings of the Association for the Advancementof Artificial Intelligence (AAAI).

Chris Kedzie, Kathleen McKeown, and Hal Daume III.2018. Content selection in deep learning models ofsummarization. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing (EMNLP).

Gunhee Kim, Leonid Sigal, and Eric P. Xing. 2014.Joint summarization of large-scale collections ofweb images and videos for storyline reconstruction.In Proceedings of CVPR.

Kevin Knight and Daniel Marcu. 2002. Summariza-tion beyond sentence extraction: A probabilistic ap-proach to sentence compression. Artificial Intelli-gence.

Wojciech Kryscinski, Romain Paulus, Caiming Xiong,and Richard Socher. 2018. Improving abstractionin text summarization. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Alex Kulesza and Ben Taskar. 2011. Learning de-terminantal point processes. In Proceedings of theConference on Uncertainty in Artificial Intelligence(UAI).

Alex Kulesza and Ben Taskar. 2012. DeterminantalPoint Processes for Machine Learning. Now Pub-lishers Inc.

Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt,Doo Soon Kim, Seokhwan Kim, Walter Chang, andFei Liu. 2019. Scoring sentence singletons and pairsfor abstractive summarization. In Proceedings ofthe Annual Meeting of the Association for Compu-tational Linguistics (ACL).

Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018.Adapting the neural encoder-decoder frameworkfrom single to multi-document summarization. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).

Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013.Document summarization via guided sentence com-pression. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Page 11: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

Chen Li, Yang Liu, Fei Liu, Lin Zhao, and FuliangWeng. 2014. Improving multi-document summa-rization by sentence compression based on expandedconstituent parse tree. In Proceedings of the Con-ference on Empirical Methods on Natural LanguageProcessing (EMNLP).

Chin-Yew Lin. 2004. ROUGE: a package for au-tomatic evaluation of summaries. In Proceedingsof ACL Workshop on Text Summarization BranchesOut.

Hui Lin and Jeff Bilmes. 2010. Multi-document sum-marization via budgeted maximization of submodu-lar functions. In Proceedings of NAACL.

Wencan Luo and Diane Litman. 2015. Summarizingstudent responses to reflection prompts. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP).

Wencan Luo, Fei Liu, Zitao Liu, and Diane Litman.2016. Automatic summarization of student coursefeedback. In Proceedings of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL).

Andre F. T. Martins and Noah A. Smith. 2009. Sum-marization with a joint model for sentence extrac-tion and compression. In Proceedings of the ACLWorkshop on Integer Linear Programming for Natu-ral Language Processing.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.SummaRuNNer: A recurrent neural network basedsequence model for extractive summarization ofdocuments. In Proceedings of the Thirty-First AAAIConference on Artificial Intelligence (AAAI).

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequencernns and beyond. In Proceedings of SIGNLL.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In Proceedings ofthe 16th Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL-HLT).

Ani Nenkova and Kathleen McKeown. 2011. Auto-matic summarization. Foundations and Trends inInformation Retrieval.

Paul Over and James Yen. 2004. An introduction toDUC-2004. National Institute of Standards andTechnology.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for sentence sum-marization. In Proceedings of EMNLP.

Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton.2017. Dynamic routing between capsules. In Pro-ceedings of the 31st Conference on Neural Informa-tion Processing Systems (NIPS).

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Chao Shen and Tao Li. 2010. Multi-document summa-rization via the minimum dominating set. In Pro-ceedings of the International Conference on Com-putational Linguistics (COLING).

Kaiqiang Song, Lin Zhao, and Fei Liu. 2018.Structure-infused copy mechanisms for abstractivesummarization. In Proceedings of the InternationalConference on Computational Linguistics (COL-ING).

Hiroya Takamura and Manabu Okumura. 2009. Textsummarization model based on maximum coverageproblem and its variant. In Proceedings of the Euro-pean Chapter of the Association for ComputationalLinguistics (EACL).

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.Abstractive document summarization with a graph-based attentional neural model. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Sansiri Tarnpradab, Fei Liu, and Kien A. Hua. 2017.Toward extractive summarization of online forumdiscussions via hierarchical attention networks. InProceedings of the 30th Florida Artificial Intelli-gence Research Society Conference (FLAIRS).

Kapil Thadani and Kathleen McKeown. 2013. Sen-tence compression with joint structural inference. InProceedings of CoNLL.

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,and Ani Nenkova. 2007. Beyond SumBasic: Task-focused summarization with sentence simplificationand lexical expansion. Information Processing andManagement, 43(6):1606–1618.

Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-rian, and Claire Cardie. 2013. A sentence com-pression based framework to query-focused multi-document summarization. In Proceedings of ACL.

Adina Williams, Nikita Nangia, and Samuel R. Bow-man. 2018. A broad-coverage challenge corpus forsentence understanding through inference. In Pro-ceedings of the North American Chapter of the As-sociation for Computational Linguistics (NAACL).

Page 12: arXiv:1906.00072v1 [cs.CL] 31 May 2019 · sentence (dis)similarity, then leverage DPP to ob-tain a set of diverse summary sentences. Tradition-ally, the DPP method computes similarity

Min Yang, Qiang Qu, Ying Shen, Qiao Liu, Wei Zhao,and Jia Zhu. 2018. Aspect and sentiment awareabstractive review summarization. Proceedings ofthe International Conference on Computational Lin-guistics (COLING).

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek, Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based neural multi-documentsummarization. In Proceedings of the Confer-ence on Computational Natural Language Learning(CoNLL).

Dani Yogatama, Fei Liu, and Noah A. Smith. 2015.Extractive summarization by maximizing semanticvolume. In Proceedings of the Conference on Em-pirical Methods on Natural Language Processing(EMNLP).

David Zajic, Bonnie J. Dorr, Jimmy Lin, and RichardSchwartz. 2007. Multi-candidate reduction: Sen-tence compression as a tool for document summa-rization tasks. Information Processing and Manage-ment.

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grau-man. 2016. Video summarization with long short-term memory. In Proceedings of the European Con-ference on Computer Vision (ECCV).

Wei Zhao, Haiyun Peng, Steffen Eger, Erik Cambria,and Min Yang. 2019. Towards scalable and general-izable capsule network and its NLP applications. InProceedings of the Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, SoufeiZhang, and Zhou Zhao. 2018. Investigating cap-sule networks with dynamic routing for text clas-sification. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.2017. Selective encoding for abstractive sentencesummarization. In Proceedings of the Annual Meet-ing of the Association for Computational Linguistics(ACL).

A Supplemental Material

In this section we summarize the hyperparametersused for the capsule networks. They include: theembedding size E is set to 300 dimensions; themaximum sentence length L is 44 words; in theconvolutional layer we use d=100 filters for eachfilter size, and there are 5 filter sizes in total: k ∈{3, 4, 5, 6, 7}. The number of high-level capsulesM is set to 12, and the dimension of capsules B isset to 30, both are tuned on the development set.The dynamic routing process is repeated for r=3iterations, following (Sabour et al., 2017). Further,the coefficient λ for the reconstruction loss term isset to 5e-5. We use a vocabulary of 50K wordsfor reconstructing the sentences; they are the mostfrequently appearing words of the dataset.