reading like her: human reading inspired extractive ...reading like her: human reading inspired...

11
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3033–3043, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 3033 Reading Like HER: Human Reading Inspired Extractive Summarization Ling Luo 1,3 , Xiang Ao 1,3* , Yan Song 2 , Feiyang Pan 1,3 , Min Yang 4 , Qing He 1,3 1 Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Sinovation Ventures 3 University of Chinese Academy of Sciences 4 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences {luoling18s, aoxiang, panfeiyang, heqing}@ict.ac.cn, [email protected] Abstract In this work, we re-examine the problem of extractive text summarization for long docu- ments. We observe that the process of extract- ing summarization of human can be divided into two stages: 1) a rough reading stage to look for sketched information, and 2) a sub- sequent careful reading stage to select key sentences to form the summary. By simulat- ing such a two-stage process, we propose a novel approach for extractive summarization. We formulate the problem as a contextual- bandit problem and solve it with policy gra- dient. We adopt a convolutional neural net- work to encode gist of paragraphs for rough reading, and a decision making policy with an adapted termination mechanism for careful reading. Experiments on the CNN and Daily- Mail datasets show that our proposed method can provide high-quality summaries with var- ied length, and significantly outperform the state-of-the-art extractive methods in terms of ROUGE metrics. 1 Introduction Automatic text summarization has wide popular- ity in NLP applications such as producing di- gests, headlines and reports. Among the super- vised methods, two main types are usually ex- plored, namely abstractive and extractive sum- marizations (Nenkova et al., 2011). Compared with abstractive approaches, extractive methods are more practical and applicable as they are faster, simpler and more reliable on grammar as well as semantic information (Yao et al., 2018). Recent studies (Cheng and Lapata, 2016; Nal- lapati et al., 2017; Yasunaga et al., 2017; Feng et al., 2018) consider extractive summarization as a sequence labeling task, where each sen- tence is individually processed and determined * Corresponding author. whether it should be extracted or not. Vari- ous neural networks are used to label each sen- tence and trained using cross-entropy loss to max- imize the likelihood of the ground-truth labeled sequences, which may derive the mismatch be- tween the cross-entropy objective function and the evaluation criterion. On the other hand, some reinforcement learning based methods (Wu and Hu, 2018; Narayan et al., 2018; Yao et al., 2018) directly optimize the evaluation metric by com- bining cross-entropy loss with rewards and train model with policy gradient reinforcement learn- ing. Note that the rewards usually reflect the qual- ity of extracted summary and measured by stan- dard evaluation protocol. However, they still se- quentially process text and tend to extract earlier sentences over later ones due to the sequential na- ture of selection (Dong et al., 2018). Although great efforts have been devoted to this field, most of the existing approaches neglect how human being reads and forms summaries. Human beings are very good at refining the main idea of a given text based on their reading cognitive pro- cess. Note that the reading habits of native speak- ers are varied and hard to modeled, so we adopt the three reading phases of second-language read- ers where there are potential behavior patterns. Such reading process of second-language read- ers generally includes pre-reading, reading and post-reading (Avery and Graves, 1997; Saricoban, 2002; Toprak and Almacıo˘ glu, 2009; Pressley and Afflerbach, 2012). In the pre-reading stage, they roughly preview the whole text to form an initial cognition and extract general but coarse-grained information at the meantime. Based on such prior knowledge, the subsequent reading stage is a con- scious process that focuses on target-specific pur- poses to search fine-grained details through re- peated skimming and scanning. For post-reading, re-reading is performed to check whether there are

Upload: others

Post on 03-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 3033–3043,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

    3033

    Reading Like HER: Human Reading Inspired Extractive Summarization

    Ling Luo1,3, Xiang Ao1,3∗, Yan Song2, Feiyang Pan1,3, Min Yang4, Qing He1,31Key Lab of Intelligent Information Processing, Institute of Computing Technology,

    Chinese Academy of Sciences, Beijing, China2Sinovation Ventures

    3University of Chinese Academy of Sciences4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences{luoling18s, aoxiang, panfeiyang, heqing}@ict.ac.cn, [email protected]

    Abstract

    In this work, we re-examine the problem ofextractive text summarization for long docu-ments. We observe that the process of extract-ing summarization of human can be dividedinto two stages: 1) a rough reading stage tolook for sketched information, and 2) a sub-sequent careful reading stage to select keysentences to form the summary. By simulat-ing such a two-stage process, we propose anovel approach for extractive summarization.We formulate the problem as a contextual-bandit problem and solve it with policy gra-dient. We adopt a convolutional neural net-work to encode gist of paragraphs for roughreading, and a decision making policy withan adapted termination mechanism for carefulreading. Experiments on the CNN and Daily-Mail datasets show that our proposed methodcan provide high-quality summaries with var-ied length, and significantly outperform thestate-of-the-art extractive methods in terms ofROUGE metrics.

    1 Introduction

    Automatic text summarization has wide popular-ity in NLP applications such as producing di-gests, headlines and reports. Among the super-vised methods, two main types are usually ex-plored, namely abstractive and extractive sum-marizations (Nenkova et al., 2011). Comparedwith abstractive approaches, extractive methodsare more practical and applicable as they are faster,simpler and more reliable on grammar as well assemantic information (Yao et al., 2018).

    Recent studies (Cheng and Lapata, 2016; Nal-lapati et al., 2017; Yasunaga et al., 2017; Fenget al., 2018) consider extractive summarizationas a sequence labeling task, where each sen-tence is individually processed and determined

    ∗Corresponding author.

    whether it should be extracted or not. Vari-ous neural networks are used to label each sen-tence and trained using cross-entropy loss to max-imize the likelihood of the ground-truth labeledsequences, which may derive the mismatch be-tween the cross-entropy objective function and theevaluation criterion. On the other hand, somereinforcement learning based methods (Wu andHu, 2018; Narayan et al., 2018; Yao et al., 2018)directly optimize the evaluation metric by com-bining cross-entropy loss with rewards and trainmodel with policy gradient reinforcement learn-ing. Note that the rewards usually reflect the qual-ity of extracted summary and measured by stan-dard evaluation protocol. However, they still se-quentially process text and tend to extract earliersentences over later ones due to the sequential na-ture of selection (Dong et al., 2018).

    Although great efforts have been devoted to thisfield, most of the existing approaches neglect howhuman being reads and forms summaries. Humanbeings are very good at refining the main idea ofa given text based on their reading cognitive pro-cess. Note that the reading habits of native speak-ers are varied and hard to modeled, so we adoptthe three reading phases of second-language read-ers where there are potential behavior patterns.Such reading process of second-language read-ers generally includes pre-reading, reading andpost-reading (Avery and Graves, 1997; Saricoban,2002; Toprak and Almacıoğlu, 2009; Pressley andAfflerbach, 2012). In the pre-reading stage, theyroughly preview the whole text to form an initialcognition and extract general but coarse-grainedinformation at the meantime. Based on such priorknowledge, the subsequent reading stage is a con-scious process that focuses on target-specific pur-poses to search fine-grained details through re-peated skimming and scanning. For post-reading,re-reading is performed to check whether there are

  • 3034

    !"##$%%&'()*+,-+.)*/01.,)2+.)-'(1/)31/,-)4-5.6(7)"08)1.)9:)7(5/,;)5.?+0@(/)"5.0?A,)B%C…

    D(/156)3++-5=(),'+2(5.?+0@(/)8+61?()255.?+0@(/)"5.0?A,)B%C…

    #"D(/156)3++-5=(),'+2(5.?+0@(/)8+61?()255.?+0@(/;)-'()*/01.,)-++A)3+0/)+3)-'()65,-)31@()=5F(,)3/+F…

    :

    N9

    UU

    #"V1@1

  • 3035

    LKLK

    Policy Gradient

    General

    Net

    Rough Reading

    Document

    Careful Reading

    If

    Terminated

    ROUGE

    Reward

    S1S1

    S2S2

    SN−1SN−1

    SNSN

    D̄̄D Sa1Sa1

    Encoder …

    L2L2L1L1Local

    Net Sentence

    Embeddings

    µ1µ1

    µ2µ2

    µN−1µN−1

    µNµN

    µ ∈ [0, 1]µ ∈ [0, 1]

    Sa2Sa2

    SatSat

    Decoder

    bandit

    policy…

    Selected

    Sentences

    If Not

    Terminated

    Gr1Gr1

    Gr2Gr2

    Reference

    Summary

    Global Feature

    Local Features Sentence

    Affinities

    GrnGrn

    Figure 2: The overall framework of HER is formulated as a contextual bandit and can be divided into a two-stageprocess containing rough reading and careful reading.

    As illustrated in Figure 2, the framework can bedivided into two stages: rough reading and care-ful reading. During rough reading, a documentwith N sentences is encoded into sentence vec-tors {S1, S2, . . . , SN} as well as a set of featuresdenoted as F , which includes one global feature S̄representing the whole documentary informationand K local features {L1, L2, . . . , LK} depictingparagraphical contentsIn careful reading, sentencevectors are decoded into real-valued scores calledsentence affinities {µ1, µ2, . . . , µN}, which can beconsidered as an estimation of sentence correla-tion to cover the context. Then a bandit policyis used to repeatedly choose unique sentence un-til the termination mechanism is triggered.

    We will detail the preliminaries in Sec. 2.1, therough reading stage in Sec. 2.2, and the carefulreading stage in Sec. 2.3. The training process isillustrated in Sec. 2.4.

    2.1 Summarization as Contextual-Bandit

    Contextual-bandit (Langford and Zhang, 2008; Liet al., 2010; Pan et al., 2019a) is the multi-armedbandit (Auer et al., 2002) problem with featuredcontexts. At each step, the agent observes a con-text, selects an action based on the context, andthen receives a reward. The agent’s goal is toquickly find a decision making policy to maximizeits return.

    In extractive summarization, the goal of thetask is to extract an undetermined number of sen-tences from the original document as summary.We show that it can be formulated as a contextualbandit problem if we select the sentences sequen-tially. Specifically, for each document, the contextincludes its documentary representation learnedthrough rough reading. At each step t = 1, 2, . . . ,

    we define the action as the index of the next sen-tence to select. The agent keeps selecting sen-tences until it reaches the termination condition(detailed in Sec. 2.3.3). Finally, the selected sen-tences SUM = (Sa1 , . . . , SaM ) corresponding tothe selected actions (a1, . . . , aM ) form an extrac-tive summary. The agent will receive a rewardR(SUM;G). Note that G is the manually-labeledgold-standard summary of the document D, andthe reward R(SUM;G) measures the correlationbetween G and the predicted summary SUM.

    2.2 Rough ReadingIn rough reading, we aim to form a general cogni-tion on a given document which encodes the doc-ument into sentence embeddings as well as pro-duces the feature set F including global and localfeatures.

    Specifically, bidirectional LSTMs (BiL-STMs) on word- and sentence-level are firstused to encode a document with N sentencesinto ds-dimentional sentence embeddings{S1, S2, . . . , SN}, Si ∈ Rds . Second, a globalfeature S̄ ∈ Rds is computed as an average ofall the sentence vectors. Third, we use a convo-lutional neural network to refine gist of differentparagraphs and generate multiple local featureson the sentence level, which is different fromprevious methods (Kim, 2014; Narayan et al.,2017; Yao et al., 2018) processing on the wordlevel. In detail, a stacking of N sentence vectorsis represented as,

    S1:N = [S1, S2, · · · , SN ] ∈ RN×ds , (1)

    We apply 1-D convolutional neural networks onS1:N followed with a max-over-time pooling sothat a final document-level representation can be

  • 3036

    extracted. Specifically, we altogether used K con-volutional nets with K different window sizes tosummarize different gists of paragraphs. Finally,by stacking the outputs together, we get the finaldocument-level representation for local featuresL1:K ∈ RK×ds .

    2.3 Careful ReadingNow that we have the sentence vectors anddocument-level features given by rough reading,we can perform careful reading to select the sen-tences one by one to form the summary.

    2.3.1 Sentence DecoderIn order to extract high-quality summaries, we firstcompute the sentence affinities by a sentence de-coder, which is observed effective in Dong et al.(2018). The sentence affinities are calculate by thefollowing principles: (1) Salience (The sentenceswhose meanings are close to the central ideashould be emphasized); (2) Coverage (The sen-tences that match different paragraphical informa-tion should be encouraged); (3) Redundancy (Theunselected sentences which are similar to alreadyextracted ones should be inhibited).

    As we need to learn the relations between eachsentence and the rest of the document, we updatethe sentence representations by,

    S′t = St ⊕ S̄ ⊕ L1:K , t = 1, . . . , N. (2)

    Then, we utilize a decoder Dec1 to score theSalience and Coverage for each sentences, and asecondary score function Dec2 to screen the sen-tences that might have Redundancy. Specifically,

    µ1 = (µ11, . . . , µ1N )> = Dec1(S

    ′1:N ), (3)

    µ2 = Dec2(S′1:N ◦ (1− µ1)), (4)

    We implement Dec1 and Dec2 as multi-layeredperceptrons. Finally, we average the two scoresas the final sentence affinities.

    µ = (µ1 + µ2)/2. (5)

    2.3.2 Bandit PolicyThe overall decision making process goes as fol-lows. The agent selects one sentence at each stepbased on the contextual information provided bythe rough reading and sentence affinities computedby the sentence decoder. It stops taking actionsonce the termination mechanism is triggered. Af-ter that, all the selected sentences are formed as

    a summary, and a final reward can be calculatedby comparing to the labeled golden summary. Asthere is no intermediate reward before termination,the goal of the agent is to find a policy to maximizeits expected long-term return.

    It is an intuitive choice to select sentences withthe highest affinities as summary, which is sim-ilar to training data selection (Song et al., 2012;Liu et al., 2019). However, such an argmax pol-icy is prone to only learn the easy patterns sinceit lacks exploration. We will show an example inSection 4.3. They should be explored to form thesummary as well. Since the search space for sum-marization is extremely large, we must explicitlyaddress the tradeoff between exploration and ex-ploitation for fast learning, which is an active re-search area in reinforcement learning and applica-tions (Pan et al., 2019a,b). In our work, we find theuse of �-greedy with stochastic policy works wellenough to encourage exploration. Specifically,with a probability of 1 − �, the agent chooses thesentence following the current policy, i.e., to sam-ple an index at ∈ [1, N ] from the multinomial dis-tribution with sentence affinities {µ1, µ2, . . . , µN}as probabilities. Otherwise, with a small probabil-ity of �, the agent randomly picks one availablesentence as an exploration. Note that such explo-ration is only used during training.

    2.3.3 Termination MechanismIn HER, we propose a termination mechanismthat is independent on future rewards to make ourmodel flexible in extracting summary with variousnumbers of sentences. This mechanism is definedas follows

    Done ∼ Bernoulli(min( µminµmax

    , 1− µmax)), (6)

    where µmax and µmin denotes the maximal andminimal value in µ for all the remained sentences,respectively. Thus Done = 1 terminates the sen-tence extraction when there is no key sentencesleft. With this mechanism, the agent will stop ex-traction with high probability as long as the differ-ences among remaining affinities are small enoughor the remaining sentence affinities are very low.

    2.4 TrainingAfter the agent sequentially takes an action at untilterminated, we can derive an summary induced bya out of a document D. Then the agent wouldreceive a reward R(SUM;G) where G is the gold-standard summary ofD. R(SUM;G) is computed

  • 3037

    by the average of three variants of ROUGE (Lin,2004). To balance precision and recall, we use F -score here,

    R(SUM;G) =1

    3(ROUGE-1f (SUM;G)

    + ROUGE-2f (SUM;G) + ROUGE-Lf (SUM;G)).(7)

    We represent the whole extractive neural networkas pθ(·|D) containing the encoder in rough read-ing and the decoder in careful reading. The goalof our model is to find parameters θ of pθ to pro-duce high-quality summary and maximize the re-wards (c.f. Eq. (8)). But we cannot obtain gradientto maximize Eq. (8) with gradient ascent as it isdiscretely sampled. So we use the likelihood ratiogradient estimator, also known as REINFORCE(Williams, 1992; Sutton et al., 2000), to acquirethe gradient by Eq. (9).

    We use Q(D) in Eq. (10) to construct pθ(a|D)following Dong et al. (2018), where z(D) =∑

    t µt(D) and � is the exploration probability ofthe �-greedy denoted in Sec. 2.3.2. M is thenumber of extracted sentences this is determinedjointly by the termination mechanism and the doc-ument context. Q(D)

    1M is adopted to present

    pθ(a|D) to avoid extracting fewer or more sen-tences when maximizing the objective function.Hence, 5θlogpθ(a|D) in Eq. (9) can be easilycomputed.

    J(θ) = E[R(SUM;G)] (8)5θJ(θ) = E[5θlogpθ(a|D)R(SUM;G)] (9)

    Q(D) =

    M∏j=1

    (�

    N − j + 1 +(1− �)µaj (D)

    z(D)−∑j−1k=1 µak (D)

    )

    (10)

    However, the exact document distribution is un-known and we cannot evaluate the expected valuein Eq. (9). So we use sampling to estimateit instead. Given a document-summary pair(D,G), our HER samples B summaries inducedby a1, . . . , aB from pθ(·|D) and obtain all the gra-dients, then the average value is considered asthe estimation. As sample-based gradient estimatemay have high variance, we use a baseline for vari-ance reduction. The gradient of the objective func-tion is finally represented as,

    5θJ(θ) ≈1

    B

    B∑b=1

    5θlogpθ(ab|D)(R(ab, G)− r̄) (11)

    where we choose self-critical reinforcement learn-ing to acquire the baseline r̄ following Ranzato

    Algorithm 1 HER: trainingInput {(Di, Gi)}, a dataset of document-summary pairsOutput θ, the updated parameters in pθ

    1: for each document-summary pair (D,G) do2: S1:N , S̄, L1:K = Encoder(D) . Rough Reading3: µ1:N = Decoder(S1:N , S̄, L1:K) . Careful Reading4: for each trail b = 1, . . . , B do5: Initialize the action set A = {1, . . . , N}6: for time step t = 1, . . . , N do7: Sample u ∼ U(0, 1)8: if u < � then9: Uniformly sample at from A

    10: else11: at ∼ Categorical(µA)12: Get termination flag by Eq.(6)13: Let M = t14: if Done then15: Break the inner loop16: else17: Update the action set A = A \{at}18: Generate summary SUMb = (Sa1 , . . . , SaM )19: lb = 1M

    ∑Mt=1 log(

    �N−t+1 +

    (1−�)µat∑j µj−

    ∑t−1k=1

    µak)

    20: Compute reward rb = R(SUMb;G)21: r̄ = R(SUMgreedy;G)22: l = 1

    B

    ∑Bb=1 lb(r̄ − rb) . Surrogate loss

    23: θ ← θ − λ5θ l . Update

    et al. (2015); Rennie et al. (2017); Paulus et al.(2017); Dong et al. (2018) computed by greedyencoding r̄ = R(agreedy;G). More concretely,agreedy = argmaxpθ(a|D) and this baseline sat-isfies that the probability of a sampled sequencewould be increased when the summary it inducesis better than what is obtained by greedy decoding.The procedure of HER is shown in Algorithm 1.

    3 Experiment Settings

    In this section we present our experimental setupfor evaluating the performance of the proposedHER, including the datasets, evaluation protocol,baselines and implementation details.

    Datasets: We evaluate our models on threedatasets: the CNN, the DailyMail and the com-bined CNN/DailyMail (Hermann et al., 2015; Nal-lapati et al., 2016). We also use the standardsplits of Hermann et al. (2015) for training, vali-dation, and test (90, 266/1, 220/1, 093 documentsfor CNN and 196, 961/12, 148/10, 397 for Daily-Mail) with the same setting in See et al. (2017).

    Evaluation: We evaluate summarization qual-ity using F1 ROUGE (Lin, 2004) includingunigram and bigram overlap (ROUGE-1 andROUGE-2) to assess informativeness and thelongest common subsequence (ROUGE-L) to as-sess fluency with the reference summaries. Weobtain ROUGE scores using a faster python im-

  • 3038

    plementation1 for training and evaluation, andthe standard pyrouge package2 for test follow-ing Dong et al. (2018).

    Baselines: We compare our proposed HERagainst four kinds of extractive methods: (1) Lead-3 model simply selects the first three sentences.(2) NN-SE (Cheng and Lapata, 2016) and Sum-maRuNNer (Nallapati et al., 2017) are sequencelabeling task and trained with cross-entropy loss.(3) Refresh (Narayan et al., 2018), DQN (Yaoet al., 2018) and RNES (Wu and Hu, 2018) extractsummary via reinforcement learning. (4) BAN-DITSUM (Dong et al., 2018) considers the extrac-tive summarization as a contextual bandit but failsto simulate human reading recognition process.

    Implementation Details: We initialize wordembeddings with 100-dimension Glove embed-dings (Pennington et al., 2014). In rough reading,the encoder is hierarchical and each layer is a two-stacked BiLSTM with a hidden size of 200. There-fore, sentence vectors and the document represen-tation S̄ have a dimension of 400. For the vari-ant CNN, we adopt filter windows H in {1, 2, 3}with 100 feature maps each and generate K = 3local representations for each document. In care-ful reading, we set � = 0.1 for bandit policy. Wealso bound the minimum and maximum numberof selected sentence to be 1 and 10 for termina-tion mechanism. During training, we use the opti-mizer Adam (Kingma and Ba, 2014) with a learn-ing rate of 10−5, beta parameters as (0, 0.999) anda weight decay of 10−6 to maximize the objectivefunction following Dong et al. (2018). We employgradient clipping of 1 for regularization and sam-pleB = 20 times for each document. We train ourmodel within two epochs. Note that we choosethe whole document as the final summary if thedocument length is less than 3 sentences as the lo-cal features cannot be obtained through the CNN-based network.

    4 Experimental Results

    4.1 Quantitative AnalysisWe first report the ROUGE metrics on the com-bined CNN/DailyMail test sets in Table 1 and theseparate results in Table 2. We can get several ob-servations from these two tables.

    1https://github.com/pltrdy/rouge2Pyrouge is a Python package. We compute all

    ROUGE scores with parameters “-a -c 95 -m -n 4 -w1.2.” Refer to https://pypi.python.org/pypi/pyrouge/0.1.3

    Model ROUGER1 R2 RLLead-3 40.0 17.5 36.2SummaRuNNer 39.6 16.2 35.3DQN 39.4 16.1 35.6Refresh 40.0 18.2 36.6RNES 41.3 18.9 37.6BANDITSUM 41.5 18.7 37.6HER 42.3 18.9 37.9

    Table 1: Results on the combined CNN/DailyMail testset. We report F1 scores of ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL). The result of Lead-3 istaken from Dong et al. (2018).

    Model CNN DailyMailR1 R2 RL R1 R2 RLLead-3 28.8 11.0 25.5 41.2 18.2 37.3NN-SE 28.4 10.0 25.0 36.2 15.2 32.9Refresh 30.4 11.7 26.9 41.0 18.8 37.7BANDITSUM 30.7 11.6 27.4 42.1 18.9 38.3HER 30.7 11.5 27.5 42.7 19.0 38.5

    Table 2: Results of the test sets on the CNN and Daily-Mail datasets separately.

    Firstly, our model generally performs the bestand even surpasses 42 on ROUGE-1 score on thecombined CNN/DailyMail dataset. It also showsbetter results on the separate datasets. We arguethat global and local features from rough read-ing can help extract summaries by capturing deepcontextual relations, and the designed structure incareful reading makes it more flexible in select-ing sentence sets. Hence a two-stage frameworkbased on the human’s reading cognition is moreappropriate for extractive summarization.

    Secondly, directly optimizing the evaluationmetric by combining cross-entropy loss with re-wards may improve the extractive results. RL-based methods, Refresh (Narayan et al., 2018) andRNES (Wu and Hu, 2018), perform better thanthe sequence labeling methods like SummaRuN-Ner (Nallapati et al., 2017). BANDITSUM (Donget al., 2018) generally performs better than theother baselines, and it reports that framing the ex-tractive summarization based on contextual banditis more suitable than sequential labeling settingand also has more search space than other RL-based methods (Narayan et al., 2018; Yao et al.,2018; Wu and Hu, 2018).

    4.2 Ablation Test

    Next, we conduct ablation test by removing themodules of the proposed HER step by step. Firstly,we replace the automatic termination mechanismwith a fixed extracting strategy that always selects

    https:// github.com/pltrdy/rougehttps://pypi.python.org/pypi/pyrouge/ 0.1.3https://pypi.python.org/pypi/pyrouge/ 0.1.3

  • 3039

    Model ROUGER1 R2 RLHER 42.3 18.9 37.9HER-3 42.0 18.5 37.6HER-3 w/o policy 41.7 18.3 37.1HER-3 w/o policy&L 41.2 18.4 37.0HER-3 w/o policy&F 40.6 18.2 36.9

    Table 3: The results of ablation test on the test splitof the combined CNN/DailyMail dataset. L and F areshort for local net and rough reading.

    three sentences for every document and we presentthe model as HER-3. Based on HER-3, we alsoremove bandit policy, local net, general net gradu-ally, and denote them as HER-3 w/o policy, HER-3 w/o policy & local net and HER-3 w/o policy& rough reading individually. The results are re-ported in Table 3 and it proves the effectiveness ofeach proposed module. Firstly, HER constructedwith an automatic termination mechanism is moreflexible and reliable in extracting various numbersof sentences varying different documents. Sec-ondly, HER use �-greedy to select sentences in or-der to raise the exploration chances on discoveringimportant but easily ignored information. Thirdly,general cognition from rough reading process isuseful in extractive summarizarion.

    4.3 A Closer Look

    To verify whether our proposed HER can simu-late human beings’ reading cognitive process, andwhether such simulation are inherently helpful onextractive summarization, we conduct extensiveevaluations and try to answer the following threequestions.(1) Can CNN-based network extract local fea-tures of different paragraphs?In Figure 3, we report the distribution of selectedsentences’ positions on our proposed model HER,BANDITSUM and HER w/o Local Net. We testeach model at 10k, 50k, 100k training steps. Itshows that all the three models can focus on dif-ferent parts of the context to form summary at firstand BANDITSUM performs the best after training10k steps. However, with training steps growing,BANDITSUM and HER w/o Local begin to pre-fer earlier sentences. HER, on the other hand, canfocus on various paragraphs and extract informa-tion from different parts of the texts with constanttraining. The contextual bandit (CB) based frame-works seems to be able to attend on various partsof the contexts to some degree in the beginning.

    0.0 0.2 0.4 0.6 0.8 1.00.0

    0.2

    0.4

    0.6

    0.8

    1.0

    HER

    0.0 0.2 0.4 0.6 0.8 1.00.0

    0.2

    0.4

    0.6

    0.8

    1.0

    BANDITSUM

    0.0 0.2 0.4 0.6 0.8 1.00.0

    0.2

    0.4

    0.6

    0.8

    1.0

    HER w/o Local Net

    0 25 50 750

    20

    40

    60

    80

    Inde

    xes

    of S

    elec

    ted

    Sent

    s 10k steps

    0 25 50 750

    20

    40

    60

    8050k steps

    0 25 50 750

    20

    40

    60

    80100k steps

    0 25 50 750

    20

    40

    60

    80

    Inde

    xes

    of S

    elec

    ted

    Sent

    s 10k steps

    0 25 50 750

    20

    40

    60

    8050k steps

    0 25 50 750

    20

    40

    60

    80100k steps

    0 25 50 75Document Length

    0

    20

    40

    60

    80

    Inde

    xes

    of S

    elec

    ted

    Sent

    s 10k steps

    0 25 50 75Document Length

    0

    20

    40

    60

    8050k steps

    0 25 50 75Document Length

    0

    20

    40

    60

    80100k steps

    Figure 3: The statistics of model HER, BANDIT-SUM (Dong et al., 2018), HER w/o Local Net on theselected sentences’ indexes varying different documentlengths. This is reported on the documents the lengthsof which are all less than 80 on the test split.

    However, with constant training, both BANDIT-SUM and HER w/o Local start to focus on earliersentences due to the nature that sentences similarto the main idea usually lie on the head of the text.As our proposed HER is equipped with a variantCNN to extract local features, our model can focuson gist of paragraphs rather than only the first sev-eral sentences. It also encourages the explorationon extracting information from various positionsmore flexibly.(2) Can the proposed bandit policy discoverlow-score but easily ignored information?To answer this question, we demonstrate a detailedcase on sentence selection in Figure 4. We observethat although the fourth sentence has a high affin-ity, it should not be included in the summary sinceits meaning is close to the third sentence whichhas already been extracted. Instead, the 13th sen-tence is supposed to be selected though it has lowaffinity. Since our HER adopts the �-greedy pol-icy, it can explore such sentence and extract it outcorrectly.(3) Can HER extract varied but proper num-bers of sentences?We answer this question by drawing the fre-

  • 3040

    Two spotted leopards, two Macaque monkeys and a brown bear will be returned to Marian Thompson…

    Sentence HERHER w/o

    policyAffinity

    0.873

    State officials have no legal power to inspect the cages before the animals are returned…

    Of the 50 animals Thompson released, 48 were killed by law enforcement, while two primates were killed by the other animals, zoo officials said.

    yesyes

    noyes

    yesno0.767

    0.297

    Index

    13

    He set off a wide scare in October when he released 50 potentially dangerous animals from his farm before shooting himself.

    0.872 yes yes

    2

    3

    4

    13

    Figure 4: A case on sentence selection of HER andHER w/o policy. The article is from CNN dataset.The highlighted indices indicate the corresponding sen-tences which should be extracted as summary.

    1 3 5 7 9

    Sentence Number0

    1000

    2000

    3000

    4000

    5000

    6000

    Frequency

    Extracted SummaryReference Summary

    Figure 5: The statistics on extracted sentence numberof our model. Frequency is the number of documents.

    quency distribution of extracted sentence num-bers by our model on the test set of combinedCNN/DailyMail, and Figure 5 exhibits the results.We observe that the frequency distribution of ex-tracted sentence number is basically similar tothat of the gold-standard summary. Comparedwith BANDITSUM which extracts fixed numberof sentences, our model shows more flexibility andextensibility on extractive summarization.

    4.4 Human Evaluation

    Lastly, we conduct a qualitative evaluation. Fol-lowing Wu and Hu (2018), we randomly sam-ple 50 documents from the test set on the com-bined CNN/DailyMail dataset and ask three volun-teers to evaluate the summaries extracted by HERw/o Dec2, HER w/o Local Net, BANDITSUMand HER, respectively. HER w/o Dec2 only usesEq. (3) to compute sentence affinities without in-hibiting redundant sentences. For each document-summary pair, they are asked to rank the output ofeach system on three aspects, namely overall qual-ity, coverage and non-redundancy. Notice that thebest one will be marked rank 1 and so on, and twosystem would be ranked the same if their extractedsummaries are identical. We report the average re-sults in Table 4 and it shows that our HER is lead-ing than BANDITSUM on overall quality and cov-

    Model Overall Coverage Non-RedundancyHER w/o Dec2 2.88 2.81 2.81HER w/o L 3.02 2.96 2.75BANDITSUM 2.06 2.07 1.91HER 1.81 1.75 1.97

    Table 4: Average rank of human evaluation in terms ofoverall performance, coverage, and non-redundancy. Lis short for local Net. Lower score is better.

    erage. Additionally, HER w/o Dec2 performs theworst on non-redundancy as it does not specializethese unselected sentences which are similar to al-ready extracted ones. Furthermore, HER w/o Lo-cal Net takes on bad performance on coverage be-cause the local features can focus on paragraphicalmessages and help to refine thorough information.

    5 Related Work

    Extractive Text Summarization Researchershave developed many statistical methods for au-tomatic extractive summarization. Traditionalmethods learn to score each sentence depen-dently (Erkan and Radev, 2004; Mihalcea and Ta-rau, 2004; Wong et al., 2008). Recently neu-ral network based extractive methods (Cheng andLapata, 2016; Nallapati et al., 2017; Feng et al.,2018; Shi et al., 2018) usually consider extractivesummarization as sequence labeling tasks and aimto minimize the cross-entropy objective function.Narayan et al. (2017) utilizes side informationto help sentence classifier while Yasunaga et al.(2017) computes the salience of each sentencefor selection with graph convolutional networks.In addition, reinforcement learning based meth-ods (Wu and Hu, 2018; Narayan et al., 2018; Yaoet al., 2018) have been proposed to directly opti-mize the evaluation metric ROUGE by combiningcross-entropy loss with rewards from policy gra-dient reinforcement learning. Dong et al. (2018)considered extractive summarization as a contex-tual bandit and it performs well especially whengood summary sentences appear late in the sourcedocument. Recently, Nallapati et al. (2017); Chenand Bansal (2018); Hsu et al. (2018) propose uni-fied models and combine the advantages of bothextractive and abstractive methods.

    Human Reading-inspired Strategy in NLPRecently, several pioneer researches began tostudy how to adapt human reading cognition pro-cess, usually including pre-reading, reading andpost-reading (Avery and Graves, 1997; Saricoban,

  • 3041

    2002; Toprak and Almacıoğlu, 2009; Pressley andAfflerbach, 2012), into various NLP-related ap-plications. For example, Li et al. (2018) solveddocument-based question answering and by simu-lating human being’s reading strategy. Luo et al.(2018, 2019) utilized the prior knowledge of hu-man reading to solve sub-tasks in sentiment anal-ysis. Song et al. (2017, 2018) enhanced wordembeddings in a similar way. Yang et al. (2019)applied it for abstractive summarization, Zhenget al. (2019) simulated human behavior for read-ing comprehension, and Lei et al. (2019) uti-lized human-like semantic cognition for aspect-level sentiment classification. In this paper, we at-tempt to perform extractive summarization underthe inspiration of human reading recognition.

    6 Conclusion

    Inspired by the reading cognition of human be-ings, we propose HER, a two-stage method, tomimic how people extract summaries. The wholelearning process is formulated as a contextualbandit trained with policy gradient reinforcementlearning. In rough reading, two neural networksare taken to encode coarse-grained information. Incareful reading, repeatedly reading are conductedto select fine-grained sentences as summary. Ex-periments on two real-world datasets demonstratethat our proposed model can significantly out-perform the state-of-the-art extractive methods onsummary quality, coverage and non-redundancy.

    Acknowledgements

    This work is partially supported by the NationalKey Research and Development Program of Chinaunder Grant No. 2017YFB1002104, the NationalNatural Science Foundation of China under GrantNo. 91846113, U1811461, 61602438, 61573335,CCF-Tencent RhinoBird Young Faculty Open Re-search Fund No. RAGR20180111. This workis also funded in part by Ant Financial throughthe Ant Financial Science Funds for Security Re-search. Xiang Ao is also supported by Youth In-novation Promotion Association CAS. This workwas also partially supported by SIAT InnovationProgram for Excellent Young Researchers (GrantNo. Y8G027). Min Yang was sponsored by CCF-Tencent Open Research Fund.

    ReferencesPeter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and

    Robert E Schapire. 2002. The nonstochastic multi-armed bandit problem. SIAM journal on computing.

    Patricia G Avery and Michael F Graves. 1997. Scaf-folding young learners’ reading of social studiestexts. Social Studies and the Young Learner,9(4):10–14.

    Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguis-tics, pages 675–686.

    Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics.

    Yue Dong, Yikang Shen, Eric Crawford, Herke vanHoof, and Jackie Chi Kit Cheung. 2018. Banditsum:Extractive summarization as a contextual bandit. InEMNLP.

    Günes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. Journal of artificial intelligence re-search, 22:457–479.

    Chong Feng, Fei Cai, Honghui Chen, and Maartende Rijke. 2018. Attentive encoder-based extrac-tive text summarization. In Proceedings of the 27thACM International Conference on Information andKnowledge Management. ACM.

    Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in neu-ral information processing systems.

    Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, KeruiMin, Jing Tang, and Min Sun. 2018. A unifiedmodel for extractive and abstractive summarizationusing inconsistency loss. In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics.

    Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP.

    Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.

    John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for contextual multi-armed ban-dits. In Proceedings of the 20th International Con-ference on Neural Information Processing Systems,pages 817–824. Citeseer.

    John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with sideinformation. In Advances in neural information pro-cessing systems, pages 817–824.

  • 3042

    Zeyang Lei, Yujiu Yang, Min Yang, Wei Zhao, JunGuo, and Yi Liu. 2019. A human-like semantic cog-nition network for aspect-level sentiment classifica-tion. In AAAI.

    Lihong Li, Wei Chu, John Langford, and Robert ESchapire. 2010. A contextual-bandit approach topersonalized news article recommendation. In Pro-ceedings of the 19th international conference onWorld wide web, pages 661–670. ACM.

    Weikang Li, Wei Li, and Yunfang Wu. 2018. A uni-fied model for document-based question answeringbased on human-like reading strategy. In Thirty-Second AAAI Conference on Artificial Intelligence.

    Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

    Miaofeng Liu, Yan Song, Hongbin Zou, and TongZhang. 2019. Reinforced training data selection fordomain adaptation. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 1957–1968, Florence, Italy.

    Ling Luo, Xiang Ao, Feiyang Pan, Jin Wang, TongZhao, Ningzi Yu, and Qing He. 2018. Beyond polar-ity: Interpretable financial sentiment analysis withhierarchical query-driven attention. In IJCAI, pages4244–4250.

    Ling Luo, Xiang Ao, Yan Song, Jinyao Li, XiaopengYang, Qing He, and Dong Yu. 2019. Unsupervisedneural aspect extraction with sememes. In IJCAI.

    Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing.

    Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments. In AAAI.

    Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Ab-stractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The20th SIGNLL Conference on Computational Natu-ral Language Learning.

    Shashi Narayan, Shay B Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies.

    Shashi Narayan, Nikos Papasarantopoulos, Shay BCohen, and Mirella Lapata. 2017. Neural extrac-tive summarization with side information. arXivpreprint arXiv:1704.04530.

    Ani Nenkova, Kathleen McKeown, et al. 2011. Auto-matic summarization. Foundations and Trends R© inInformation Retrieval, 5(2–3):103–233.

    Feiyang Pan, Qingpeng Cai, Pingzhong Tang, FuzhenZhuang, and Qing He. 2019a. Policy gradients forcontextual recommendations. In The World WideWeb Conference, pages 1421–1431. ACM.

    Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, andPingzhong Tang. 2019b. Policy optimization withmodel-based explorations. In Proceedings of theAAAI Conference on Artificial Intelligence, vol-ume 33, pages 4675–4682.

    Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304.

    Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP.

    Michael Pressley and Peter Afflerbach. 2012. Verbalprotocols of reading: The nature of constructivelyresponsive reading. Routledge.

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732.

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition.

    Arif Saricoban. 2002. Reading strategies of success-ful readers through the three phase approach. TheReading Matrix, 2(3).

    Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083.

    Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, ZhiyuanLiu, and Hanwang Zhang. 2018. Deepchannel:Salience estimation by contrastive learning for ex-tractive document summarization. arXiv preprintarXiv:1811.02394.

    Yan Song, Prescott Klassen, Fei Xia, and Chunyu Kit.2012. Entropy-based training data selection for do-main adaptation. In Proceedings of COLING 2012,pages 1191–1200, Mumbai, India.

    Yan Song, Chia-Jung Lee, and Fei Xia. 2017. Learn-ing word representations with regularization fromprior knowledge. In Proceedings of the 21st Confer-ence on Computational Natural Language Learning(CoNLL 2017), pages 143–152, Vancouver, Canada.

  • 3043

    Yan Song, Shuming Shi, and Jing Li. 2018. Joint learn-ing embeddings for chinese words and their compo-nents via ladder structured networks. In Proceed-ings of the Twenty-Seventh International Joint Con-ference on Artificial Intelligence, IJCAI-18, pages4375–4381.

    Richard S Sutton, David A McAllester, Satinder PSingh, and Yishay Mansour. 2000. Policy gradi-ent methods for reinforcement learning with func-tion approximation. In Advances in neural informa-tion processing systems.

    Elif Leyla Toprak and Gamze Almacıoğlu. 2009.Three reading phases and their applications in theteaching of english as a foreign language in readingclasses with young learners. Journal of languageand Linguistic Studies, 5(1):20–36.

    Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

    Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008.Extractive summarization using supervised andsemi-supervised learning. In Proceedings of the22nd International Conference on ComputationalLinguistics-Volume 1, pages 985–992. Associationfor Computational Linguistics.

    Yuxiang Wu and Baotian Hu. 2018. Learning to extractcoherent summary via deep reinforcement learning.In Thirty-Second AAAI Conference on Artificial In-telligence.

    Min Yang, Qiang Qu, Wenting Tu, Ying Shen, ZhouZhao, and Xiaojun Chen. 2019. Exploring human-like reading strategy for abstractive text summariza-tion. In AAAI.

    Kaichun Yao, Libo Zhang, Tiejian Luo, and YanjunWu. 2018. Deep reinforcement learning for extrac-tive document summarization. Neurocomputing.

    Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek, Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based neural multi-documentsummarization. In Proceedings of the 21st Confer-ence on Computational Natural Language Learning.

    Yukun Zheng, Jiaxin Mao, Yiqun Liu, Zixin Ye, MinZhang, and Shaoping Ma. 2019. Human behaviorinspired machine reading comprehension. In SIGIR.