unsupervised sentence enhancement for automatic summarizationjcheung/papers/sent_exp_emnlp... ·...

Unsupervised Sentence Enhancement for Automatic Summarization

Jackie Chi Kit CheungUniversity of Toronto

10 King’s College Rd., Room 3302Toronto, ON, Canada M5S [email protected]

Gerald PennUniversity of Toronto

10 King’s College Rd., Room 3302Toronto, ON, Canada M5S [email protected]

Abstract

We present sentence enhancement as anovel technique for text-to-text genera-tion in abstractive summarization. Com-pared to extraction or previous approachesto sentence fusion, sentence enhancementincreases the range of possible summarysentences by allowing the combination ofdependency subtrees from any sentencefrom the source text. Our experiments in-dicate that our approach yields summarysentences that are competitive with a sen-tence fusion baseline in terms of con-tent quality, but better in terms of gram-maticality, and that the benefit of sen-tence enhancement relies crucially on anevent coreference resolution algorithm us-ing distributional semantics. We alsoconsider how text-to-text generation ap-proaches to summarization can be ex-tended beyond the source text by exam-ining how human summary writers incor-porate source-text-external elements intotheir summary sentences.

1 Introduction

Sentence fusion is the technique of merging sev-eral input sentences into one output sentencewhile retaining the important content (Barzilayand McKeown, 2005; Filippova and Strube, 2008;Thadani and McKeown, 2013). For example, theinput sentences in Figure 1 may be fused into oneoutput sentence.

As a text-to-text generation technique, sentencefusion is attractive because it provides an avenuefor moving beyond sentence extraction in auto-matic summarization, while not requiring deep se-

Input: Bil Mar Foods Co., a meat processorowned by Sara Lee, announced a recall ofcertain lots of hot dogs and packaged meat.

Input: The outbreak led to the recall on Tues-day of 15 million pounds of hot dogs and coldcuts produced at the Bil Mar Foods plant.

Output: The outbreak led to the recall on Tues-day of lots of hot dogs and packaged meatsproduced at the Bil Mar Foods plant.

Figure 1: An example of fusing two input sen-tences into an output sentence. The sections of theinput sentences that are retained in the output areshown in bold.

mantic analysis beyond, say, a dependency parserand lexical semantic resources.

The overall trajectory pursued in the field canbe characterized as a move away from local con-texts relying heavily on the original source text to-wards more global contexts involving reformula-tion of the text. Whereas sentence extraction andsentence compression (Knight and Marcu, 2000,for example) involve taking one sentence and per-haps removing parts of it, traditional sentence fu-sion involves reformulating a small number of rel-atively similar sentences in order to take the unionor intersection of the information present therein.

In this paper, we move further along this pathin the following ways. First, we present sen-tence enhancement as a novel technique whichextends sentence fusion by combining the subtreesof many sentences into the output sentence, ratherthan just a few. Doing so allows relevant informa-tion from sentences that are not similar to the orig-inal input sentences to be added during fusion. As

Source text: This fact has been underscored inthe last few months by two unexpected out-breaks of food-borne illness.

Output: The outbreak of food-borne illness ledto the recall on Tuesday of lots of hot dogsand meats produced at the Bil Mar Foodsplant.

Figure 2: An example of sentence enhancement,in which parts of dissimilar sentences are incorpo-rated into the output sentence.

shown in Figure 2, the phrase of food-borne illnesscan be added to the previous output sentence, de-spite originating in a source text sentence that isquite different overall.

Elsner and Santhanam (2011) proposed a super-vised method to fuse disparate sentences, whichtakes as input a small number of sentences withcompatible information that have been manuallyidentified by editors of articles. By contrast, ouralgorithm is unsupervised, and tackles the prob-lem of identifying compatible event mergers in theentire source text using an event coreference mod-ule. Our method outperforms a previous syntax-based sentence fusion baseline on measures ofsummary content quality and grammaticality.

Second, we analyze how text-to-text genera-tion systems may make use of text that is not inthe source text itself, but in articles on a relatedtopic in the same domain. By examining the partsof human-written summaries that are not foundin the source text, we find that using in-domaintext allows summary writers to more precisely ex-press some target semantic content, but that moresophisticated computational semantic techniqueswill be required to enable automatic systems tolikewise do so.

A more general argument of this paper is thatthe apparent dichotomy between text-to-text gen-eration and semantics-to-text generation can beresolved by viewing them simply as having dif-ferent starting points towards the same end goalof precise and wide-coverage NLG. The statisti-cal generation techniques developed by the text-to-text generation community have been success-ful in many domains. Yet the results of our ex-periments and studies demonstrate the following:as text-to-text generation techniques move beyond

using local contexts towards more dramatic refor-mulations of the kind that human writers perform,more semantic analysis will be needed in order toensure that the reformulations preserve the infer-ences that can be drawn from the input text.

2 Related Work

A relatively large body of work exists in sentencecompression (Knight and Marcu, 2000; McDon-ald, 2006; Galley and McKeown, 2007; Cohnand Lapata, 2008; Clarke and Lapata, 2008, in-ter alia), and sentence fusion (Barzilay and McK-eown, 2005; Marsi and Krahmer, 2005; Filippovaand Strube, 2008; Filippova, 2010; Thadani andMcKeown, 2013). Unlike this work, our sentenceenhancement algorithm considers the entire sourcetext and is not limited to the initial input sentences.Few previous papers focus on combining the con-tent of diverse sentences into one output sentence.Wan et al. (2008) propose sentence augmentationby identifying “seed” words in a single originalsentence, then adding information from auxiliarysentences based on word co-occurrence counts.Elsner and Santhanam (2011) investigate the ideaof fusing disparate sentences with a supervised al-gorithm, as discussed above.

Previous studies on cut-and-paste summariza-tion (Jing and McKeown, 2000; Saggion and La-palme, 2002) investigate the operations that hu-man summarizers perform on the source text inorder to produce the summary text. Our previ-ous work argued that current extractive systemsrely too heavily on notions of information central-ity (Cheung and Penn, 2013). This paper extendsthis work by identifying specific linguistic factorscorrelated with the use of source-text-external ele-ments.

3 A Sentence Enhancement Algorithm

The basic steps in our sentence expansion algo-rithm are as follows: (1) clustering to identify ini-tial input sentences, (2) sentence graph creation,(3) sentence graph expansion, (4) tree generation,and (5) linearization.

At a high level, our method for sentence en-hancement is inspired by the syntactic sentencefusion approach of Filippova and Strube (2008)(henceforth, F&S) originally developed for Ger-man, in that it operates over the dependency parsesof a small number of input sentences to producean output sentence which fuses parts of the in-

put sentences. We adopt the same assumption asF&S that these initial core sentences have a highdegree of similarity with each other, and shouldform the core of a new sentence to be generated(Step 1). While fusion from highly disparate in-put sentences is possible, Elsner and Santhanam(2011) showed how difficult it is to do so cor-rectly, even where such cases are manually iden-tified. We thus aim for a more targeted type offusion initially. Next, the dependency trees ofthe core sentences are fused into an intermediatesentence graph (Step 2), a directed acyclic graphfrom which the final sentence will be generated(Steps 4 and 5). We will compare against our im-plementation of F&S, adapted to English.

However, unlike F&S or other previous ap-proaches to sentence fusion, the sentence enhance-ment algorithm may also avail itself of the de-pendency parses of all of the other sentences inthe source text, which expands the range of pos-sible sentences that may be produced. This is ac-complished by expanding the sentence graph withparts of these sentences (Step 3). One importantissue here is that the expansion must be modulatedby an event coreference component to ensure thatthe merging of information from different pointsin the source text is valid and does not result inincorrect or nonsensical inferences.

3.1 Core sentence identificationTo generate the core sentence clusters, we firstidentify clusters of similar sentences, then rank theclusters according to their salience. The top clus-ter in the source text is then selected to be the inputto the sentence fusion algorithms.

Sentence alignment is performed by complete-link agglomerative clustering, which requires ameasure of similarity between sentences and astopping criterion. We define the similarity be-tween two sentences to be the standard cosinesimilarity between the lemmata of the sentences,weighted by IDF and excluding stopwords, andclustering is run until a similarity threshold of0.5 is reached. Since complete-link clusteringprefers small coherent clusters and we select thetop-scoring cluster in each document collection,the method is somewhat robust to different choicesof the stopping threshold.

The clusters are scored according to the signa-ture term method of Lin and Hovy (2000), whichassigns an importance score to each term accord-

BMFoods announce recall certain lots...

outbreak led recall Tuesday 15M pounds...

nsubj dobj

nsubj dobj

prep_of

prep_ofprep_on

(a) Abbreviated dependency trees.

BMFoods announce certain lots...

outbreak led Tuesday 15M pounds...

nsubj dobj

nsubj

dobj

prep_of

prep_ofprep_onrecall

food-borne illness

prep_of

(b) Sentence graph after merging the nodes with lemma recall(in bold), and expanding the node outbreak (dashed outgoingedge).

Figure 3: An example of the input dependencytrees for sentence graph creation and expansion,using the input sentences of Figure 1.

ing to how much more often it appears in thesource text compared to some irrelevant back-ground text using a log likelihood ratio. Specifi-cally, the score of a cluster is equal to the sum ofthe importance scores of the set of lemmata in thecluster.

3.2 Sentence graph creationAfter core sentence identification, the next stepis to align the nodes of the dependency trees ofthe core input sentences in order to create the ini-tial sentence graph. The input to this step is thecollapsed dependency tree representations of thecore sentences produced by the Stanford parser1.In this representation, preposition nodes are col-lapsed into the label of the dependency edge be-tween the functor of the prepositional phrase andthe prepositional object. Chains of conjuncts arealso split, and each argument is attached to theparent. In addition, auxiliary verbs, negation par-ticles, and noun-phrase-internal elements2 are col-lapsed into their parent nodes. Figure 3a showsthe abbreviated dependency representations of theinput sentences from Figure 1.

Then, a sentence graph is created by mergingnodes that share a common lemma and part-of-

1As part of the CoreNLP suite: http://nlp.stanford.edu/software/corenlp.shtml

2As indicated by the dependency edge label nn.

speech tag. In addition, we allow synonyms tobe merged, defined as being in the same Word-Net synset. Merging is blocked if the word is astop word, which includes function words as wellas a number of very common verbs (e.g., be, have,do). Throughout the sentence graph creation andexpansion process, the algorithm disallows the ad-dition of edges that would result in a cycle in thegraph.

3.3 Sentence graph expansionThe initial sentence graph is expanded by merg-ing in subtrees from dependency parses of non-core sentences drawn from the source text. First,expansion candidates are identified for each nodein the sentence graph by finding all of the depen-dency edges in the source text from non-core sen-tences in which the governor of the edge sharesthe same lemma and POS tag as the node in thesentence graph.

Then, these candidate edges are pruned accord-ing to two heuristics. The first is to keep only onecandidate edge of each dependency relation typeaccording to the edge that has the highest informa-tiveness score (Section 3.4.1), with ties being bro-ken according to which edge has a subtree with afewer number of nodes. The second is to performevent coreference in order to prune away thosecandidate edges which are unlikely to be describ-ing the same event as the core sentences, as ex-plained in the next section. Finally, any remainingcandidate edges are fused into the sentence graph,and the subtree rooted at the dependent of the can-didate edge is added to the sentence graph as well.See Figure 3b for an example of sentence graphcreation and expansion.

3.3.1 Event coreferenceOne problem of sentence fusion is that the differ-ent inputs of the fusion may not refer to the sameevent, resulting in an incorrect merging of infor-mation, as would be the case in the following ex-ample:S1: Officers pled not guilty but risked 25 years to

life.S2: Officers recklessly engaged in conduct which

seriously risked the lives of others.Here, the first usage of risk refers to the potentialsentence imposed if the officers are convicted ina trial, whereas the second refers to the potentialharm caused by the officer.

Context 1: Officers ... risked 25 years to life...

(nsubj, officers) (dobj, life)

(nsubj, conduct) (advmod, seriously) (dobj, life)

sim1((risk, dobj), (risk, dobj)) × sim2(life, life) = 1.0

sim1((risk, nsubj), (risk, nsubj)) × sim2(officer, conduct) = 0.38

Context 2: ...conduct seriously risked the lives...

Figure 4: Event coreference resolution as amaximum-weight bipartite graph matching prob-lem. All the nodes share the predicate risk.

In order to ensure that sentence enhancementdoes not lead to the merging of such incompati-ble events, we designed a simple method to ap-proximate event coreference resolution that doesnot require event coreference labels. This methodis based on the intuition that different mentions ofan event should contain many of the same partic-ipants. Thus, by measuring the similarity of thearguments and the syntactic contexts between thenode in the sentence graph and the candidate edge,we can have a measure of the likelihood that theyrefer to the same event.

We would be interested in integrating existingevent coreference resolution systems into this stepin the future, such as the unsupervised methodof Bejan and Harabagiu (2010). Existing eventcoreference systems tend to focus on cases withdifferent heads (e.g., X kicked Y, then Y was in-jured), which could increase the possibilities forsentence enhancement, if the event coreferencemodule is sufficiently accurate. However, sinceour method currently only merges identical heads,we require a more fine-grained method based ondistributional measures of similarity.

We measure the similarity of these syntacticcontexts by aligning the arguments in the syn-tactic contexts and computing the similarity ofthe aligned arguments. These problems can bejointly solved as a maximum-weight bipartitegraph matching problem (Figure 4). Formally, leta syntactic context be a list of dependency triples(h, r, a), consisting of a governor or head node hand a dependent argument a in the dependency re-lation r, where head node h is fixed across each

element of the list. Then, each of the two in-put syntactic contexts forms one of the two dis-joint sets in a complete weighted bipartite graphwhere each node corresponds to one dependencytriple. We define the edge weights according tothe similarities of the edge’s incident nodes; i.e.,between two dependency triples (h1, r1, a1) and(h2, r2, a2). We also decompose the similarityinto the similarities between the head and relationtypes ((h1, r1) and (h2, r2)), and between the ar-guments (a1 and a2). The edge weight function isthus:

sim((h1, r1, a1), (h2, r2, a2)) = (1)

sim1((h1, r1), (h2, r2))× sim2(a1, a2),

where sim1 and sim2 are binary functions that rep-resent the similarities between governor-relationpairs and dependents, respectively. We train mod-els of distributional semantics using a large back-ground corpus; namely, the Annotated Gigawordcorpus (Napoles et al., 2012). For sim1, we cre-ate a vector of counts of the arguments that areseen filling each (h, r) pair, and define the similar-ity between two such pairs to be the cosine simi-larity between their argument vectors. For sim2,we create a basic vector-space representation ofa word d according to words that are found inthe context of word d within a five-word contextwindow, and likewise compute the cosine simi-larity between the word vectors. These methodsof computing distributional similarity are well at-tested in lexical semantics for measuring the re-latedness of words and syntactic structures (Tur-ney and Pantel, 2010), and similar methods havebeen applied in text-to-text generation by Ganitke-vitch et al. (2012), though the focus of that work isto use paraphrase information thus learned to im-prove sentence compression.

The resulting graph matching problem is solvedusing the NetworkX package for Python3. The fi-nal similarity score is an average of the similarityscores from Equation 1 that participate in the se-lected matching, weighted by the product of theIDF scores of the dependent nodes of each edge.This final score is used as a threshold that candi-date contexts from the source text must meet inorder to be eligible for being merged into the sen-tence graph. This threshold was tuned by cross-validation, and can remain constant, although re-

3http://networkx.github.io/

tuning to different domains (a weakly supervisedalternative) is likely to be beneficial.

3.4 Tree generationThe next major step of the algorithm is to extractan output dependency tree from the expanded sen-tence graph. We formulate this as an integer linearprogram, in which variables correspond to edgesof the sentence graph, and a solution to the linearprogram determines the structure of an output de-pendency tree. We use ILOG CPLEX to solve allof the integer linear programs in our experiments.

A good dependency tree must at once expressthe salient or important information present in theinput text as well as be grammatically correct andof a manageable length. These desiderata are en-coded into the linear program as constraints or aspart of the objective function.

3.4.1 Objective functionWe designed an objective function that considersthe importance of the words and syntactic rela-tions that are selected as well as accounts for re-dundancy in the output sentence. Let X be the setof variables in the program, and let each variablein X take the form xh,r,a, a binary variable thatrepresents whether an edge in the sentence graphfrom a head node with lemma h to an argumentwith lemma a in relation r is selected. For a lexi-con Σ, our objective function is:

max∑w∈Σ

maxxh,r,a∈Xs.t.a=w

(xh,r,w · P (r|h) · I(w)),

(2)

where P (r|h) is the probability that head hprojects the dependency relation r, and I(w) isthe informativeness score for word w as definedby Clarke and Lapata (2008). This formulationencourages the selection of words that are infor-mative according to I(w) and syntactic relationsthat are probable. The inner max function for eachw in the lexicon encourages non-redundancy, aseach word may only contribute once to the objec-tive value. This function can be rewritten into aform compatible with a standard linear program bythe addition of auxiliary variables and constraints.For more details of how this and other aspects ofthe linear program are implemented, see the sup-plementary document.

3.4.2 ConstraintsWell-formedness constraints, taken directly fromF&S, ensure that the set of selected edges pro-

duces a tree. Another constraint limits the numberof content nodes in the tree to 11, which corre-sponds to the average number of content nodes inhuman-written summary sentences in the data set.Syntactic constraints aim to ensure grammatical-ity of the output sentence. In addition to the con-straint proposed by F&S regarding subordinatingconjunctions, we propose two other ones. The firstensures that a nominal or adjectival predicate mustbe selected with a copular construction at the toplevel of a non-finite clause. The second ensuresthat transitive verbs retain both of their comple-ments in the output4. Semantic constraints ensurethat only noun phrases of sufficiently high simi-larity which are not in a hyperonym-hyponym orholonym-meronym relation with each other maybe joined by coordination.

3.5 LinearizationThe final step of our method is to linearize the de-pendency tree from the previous step into the finalsequence of words. We implemented our own lin-earization method to take advantage of the order-ing information can be inferred from the originalsource text sentences.

Our linearization algorithm proceeds top-downfrom the root of the dependency tree to the leaves.At each node of the tree, linearization consists ofrealizing the previously collapsed elements suchas prepositions, determiners and noun compoundelements, then ordering the dependent nodes withrespect to the root node and each other. Restoringthe collapsed elements is accomplished by simpleheuristics. For example, prepositions and deter-miners precede their accompanying noun phrase.

The dependent nodes are ordered by a sort-ing algorithm, where the order between two syn-tactic relations and dependent nodes (r1, a1) and(r2, a2) is determined as follows. First, if a1 anda2 originated from the same source text sentence,then they are ordered according to their order ofappearance in the source text. Otherwise, we con-sider the probability P (r1 precedes r2), and ordera1 before a2 iff P (r1 precedes r2) > 0.5. Thisdistribution, P (r1 precedes r2), is estimated bycounting and normalizing the order of the relationtypes in the source text corpus. For the purposesof ordering, the governor node is treated as if it

4We did not experiment with changing the grammaticalvoice in the output tree, such as introducing a passive con-struction if only a direct object is selected, but this is onepossible extension of the algorithm.

were a dependent node with a special syntactic re-lation label self. This algorithm always producesan output ordering with a projective dependencytree, which is a reasonable assumption for English.

4 Experiments

4.1 MethodRecent approaches to sentence fusion have of-ten been evaluated as isolated components. Forexample, F&S evaluate the output sentences byasking human judges to rate the sentences’ in-formativeness and grammaticality according to a1–5 Likert scale rating. Thadani and McKe-own (2013) combine grammaticality ratings withan automatic evaluation which compares the sys-tem output against gold-standard sentences drawnfrom summarization data sets. However, this eval-uation setting still does not reflect the utility ofsentence fusion in summarization, because theinput sentences come from human-written sum-maries rather than the original source text.

We adopt a more realistic setting of using sen-tence fusion in automatic summarization by draw-ing the input or core sentences automatically fromthe source text, then evaluating the output of thefusion and expansion algorithm directly as one-sentence summaries according to standard sum-marization evaluation measures of content quality.

Data preparation. Our experiments are con-ducted on the TAC 2010 and TAC 2011 GuidedSummarization corpus (Owczarzak and Dang,2010), on the initial summarization task. Eachdocument cluster is summarized by one sentence,generated from an initial cluster of core sentencesas described in Section 3.1.

Evaluation measures. We evaluate summarycontent quality using the word-overlap measuresROUGE-1 and ROUGE-2, as is standard in thesummarization community. We also measure thequality of sentences at a syntactic or shallow se-mantic level that operates at the level of depen-dency triples by a measure that we call Pyra-mid BE. Specifically, we extract all of the depen-dency triples of the form t = (h, r, a) from thesentence under evaluation and the gold-standardsummaries, where h and a are the lemmata ofthe head and the argument, and r is the syntac-tic relation, normalized for grammatical voice andexcluding the collapsed edges which are mostlynoun-phrase-internal elements and grammatical

Method Pyramid BE ROUGE-1 ROUGE-2 Log Likelihood Oracle Pyramid BEFusion (F&S) 10.61 10.07 2.15 -159.31 28.00Expansion 8.82 9.41 1.82 -157.46 52.97+Event coref 11.00 9.76 1.93 -156.20 40.30

Table 1: Results of the sentence enhancement and fusion experiments.

particles. Then, we perform a matching betweenthe set of triples in the sentence under evalua-tion and in a reference summary following theTransformed BE method of Tratz and Hovy (2008)with the total weighting scheme. This match-ing is performed between the sentence and ev-ery gold-standard summary, and the maximum ofthese scores is taken. This score is then dividedby the maximum score that is achievable using thenumber of triples present in the input sentence, asinspired by the Pyramid method. This denom-inator is more appropriate than the one used inTransformed BE, which is designed for the casewhere the evaluated summary and the referencesummaries are of comparable length.

For grammaticality, we parse the output sen-tences using the Stanford parser5, and use the loglikelihood of the most likely parse of the sentenceas a coarse estimate of grammaticality. Parse loglikelihoods have been shown to be useful in deter-mining grammaticality (Wagner et al., 2009), andmany of the problems associated with using it donot apply in our evaluation, because our sentenceshave a fixed number of content nodes, and containsimilar content. While we could have conducteda user study to elicit Likert-scale grammaticalityjudgements, such results are difficult to interpretand the scores depend heavily on the set of judgesand the precise evaluation setting, as is the case forsentence compression (Napoles et al., 2011).

4.2 Results and discussionAs shown in Table 1, sentence enhancement withcoreference outperforms the sentence fusion algo-rithm of F&S in terms of the Pyramid BE measureand the baseline expansion algorithm, though onlythe latter difference is statistically significant (p =0.0196). In terms of the ROUGE word overlap

5The likelihoods are obtained by the PCFG model ofCoreNLP version 1.3.2. We experimented with the Berke-ley parser (Petrov et al., 2006) as well, with similar resultsthat favour the sentence enhancement with event coreferencemethod, but because the parser failed to parse a number ofcases, we do not report those results here.

6All statistical significance results in this section are forWilcoxon signed-rank tests.

measures, fusion achieves a better performance,but it only outperforms the expansion baselinesignificantly (ROUGE-1: p = 0.021, ROUGE-2: p = 0.012). Note that the ROUGE scoresare low because they involve comparing a one-sentence summary against a paragraph-long goldstandard. The average log likelihood result sug-gests that sentence enhancement with event coref-erence produces sentences that are more grammat-ical than traditional fusion does, and this differ-ence is statistically significant (p = 0.044). Theseresults show that sentence enhancement with eventcoreference is competitive with a strong previoussentence fusion method in terms of content, de-spite having to combine information from morediverse sentences. This does not come at the ex-pense of grammaticality; in fact, it seems that hav-ing a greater possible range of output sentencesmay even improve the grammaticality of the out-put sentences.

Oracle score. To examine the potential of sen-tence enhancement, we computed an oracle scorethat provides an upper bound to the best possi-ble sentence that may be extracted from the sen-tence graph. First, we ranked all of dependencytriples found in each gold-standard summary bytheir score (i.e., the number of gold-standard sum-maries they appear in). Then, we took the high-est scoring triples from this ranking that are foundin the sentence graph until the length limit wasreached, and divided by the Pyramid-based de-nominator as above7. The oracle score is the max-imum of these scores over the gold-standard sum-maries. The resulting oracle scores are shownin the rightmost column of Table 1. While itis no surprise that the oracle score improves af-ter the sentence graph is expanded, the large in-crease in the oracle score indicates the potential ofsentence enhancement for generating high-qualitysummary sentences.

7There is no guarantee that these dependency triples forma tree structure. Hence, this is an upper bound.

Grammaticality. There is still room for im-provement in the grammaticality of the generatedsentences, which will require modelling contextslarger than individual predicates and their argu-ments. Consider the following output of the sen-tence enhancement with event coreference system:

(3) The government has launched aninvestigation into Soeharto’s wealth by theAttorney General’s office on the wealth offormer government officials.

This sentence suffers from coherence problemsbecause two pieces of information are duplicated.The first is the subject of the investigation, whichis expressed by two prepositional objects of in-vestigation with the prepositions into and on.The second, more subtle incoherence concernsthe body that is responsible for the investigation,which is expressed both by the subject of launch(The government has launched an investigation),and the by-prepositional object of investigation (aninvestigation ... by the Attorney General’s office).Clearly, a model that makes fewer independenceassumptions about the relation between differentedges in the sentence graph is needed.

5 A Study of Source-External Elements

The sentence enhancement algorithm presentedabove demonstrates that it is possible to use theentire source text to produce an informative sen-tence. Yet it is still limited by the particular pred-icates and dependency relations that are foundin the source. The next step towards develop-ing abstractive systems that exhibit human-like be-haviour is to try to incorporate elements into thesummary that are not found in the source text atall.

Despite its apparent difficulty, there is reason tobe hopeful for text-to-text generation techniqueseven in such a scenario. In particular, we showedin earlier work that almost all of the caseframes,or pairs of governors and relations, in human-written summaries can be found in the source textor in a small set of additional related articles thatbelong to the same domain as the source text (e.g.,natural disasters) (Cheung and Penn, 2013). Whatthat study lacks, however, is a detailed analysisof the factors surrounding why human summarywriters use non-source-text elements in their sum-maries, and how these may be automatically iden-tified in the in-domain text. In this section, we

supply such an analysis and provide evidence thathuman summary writers actually do incorporateelements external to the source text for a reason,namely, that these elements are more specific tothe semantic content that they wish to convey. Wealso identify a number of features that may be use-ful for automatically determining the appropriate-ness of these in-domain elements in a summary.

5.1 MethodWe performed our analysis on the predicatespresent in text, such as kill and computer. We alsoanalyzed predicate-relation pairs (PR pairs) suchas (kill, nsubj) or (computer, amod). This choiceis similar to the caseframes used by Cheung andPenn (2013), and we similarly apply transforma-tions to normalize for grammatical voice and othersyntactic alternations, but we consider PR pairs ofall relation types, unlike caseframes, which onlyconsider verb complements and prepositional ob-jects. PR pairs are extracted from the prepro-cessed corpus. We use the TAC 2010 GuidedSummarization data set for our analyses, whichwe organize into two sub-studies. In the prove-nance study, we divide the PR pairs in human-written summaries according to whether they arefound in the source text (source-internal) or not(source-external). In the domain study, we dividein-domain but source-external predicate-relationpairs according to whether they are used in ahuman-written summary (gold-standard) or not(non-gold-standard).

5.2 Provenance StudyIn the first study, we compare the characteristicsof gold-standard predicates and PR pairs accord-ing to their provenance; that is, are they found inthe source text itself? The question that we try toanswer is why human summarizers need to lookbeyond the source text at all when writing theirsummaries. We will provide evidence that they doso because they can find predicates that are moreappropriate to the content that is being expressedaccording to two quantitative measures.

Predicate provenance. Source-external PRpairs may be external to the source text for tworeasons. Either the predicate (i.e., the actual word)is found in the source text, but the dependencyrelation (i.e., the semantic predication that holdsbetween the predicate and its arguments) isnot found with that particular predicate, or the

Average freq (millions)Source-internal 1.77 (1.57, 2.08)Source-external 1.15 (0.99, 1.50)

(a) The average predicate frequency of source-internal vs.source-external gold-standard predicates in an external corpus.

Arg entropySource-internal 7.94 (7.90, 7.97)Source-external 7.42 (7.37, 7.48)

(b) The average argument entropy of source-internal vs. source-external PR pairs in bits.

Table 2: Results of the provenance study. 95%confidence intervals are estimated by the bootstrapmethod and indicated in parentheses.

predicate itself may be external to the source textaltogether. If the former is true, then a generalizedversion of the sentence enhancement algorithmpresented in this paper could in principle capturethese PR-pairs. We thus compute the proportionof source-external PR pairs where the predicatealready exists in the source text.

We find that 2413 of the 4745 source-externalPR pairs, or 51% have a predicate that can befound in the source text. This indicates that anextension of the sentence enhancement with eventcoreference approach presented in this paper couldcapture a substantial portion of the source-externalPR pairs in its hypothesis space already.

Predicate frequency. What factors then can ac-count for the remaining predicates that are notfound in the source text at all? The first such fac-tor we identify is the frequency of the predicates.Here, we take frequency to be the number of oc-currences of the predicate in an external corpus;namely the Annotated Gigaword, which gives usa proxy for the specificity or informativeness of aword. In this comparison, we take the set of pred-icates in human-written summaries, divide themaccording to whether they are found in the sourcetext or not, and then look up their frequency of ap-pearance in the Annotated Gigaword corpus.

As Table 2a shows, the predicates that are notfound in the source text consist of significantly lessfrequent words on average (Wilcoxon rank-sumstest, p < 10−17). This suggests that human sum-mary writers are motivated to use source-externalpredicates, because they are able to find a more in-

formative or apposite predicate than the ones thatare available in the source text.

Entropy of argument distribution. Anothermeasure of the informativeness or appropriatenessof a predicate is to examine the range of argumentsthat it tends to take. A more generic word wouldbe expected to take a wider range of arguments,whereas a more particular word would take a nar-rower range of arguments, for example those ofa specific entity type. We formalize this notionby measuring the entropy of the distribution of ar-guments that a predicate-relation pair takes as ob-served in Annotated Gigaword. Given frequencystatistics f(h, r, a) of predicate head h taking anargument word a in relation r, we define the argu-ment distribution of predicate-relation pair (h, r)as:

P (a|h, r) = f(h, r, a)/∑a′

f(h, r, a′) (4)

We then compute the entropy of P (a|h, r) for thegold-standard predicate-relation pairs, and com-pare the average argument entropies of the source-internal and the source-external subsets.

Table 2b shows the result of this comparison.Source-external PR pairs exhibit a lower averageargument entropy, taking a narrower range of pos-sible arguments. Together these two findings indi-cate that human summary writers look beyond thesource text not just for the sake of diversity or toavoid copying the source text; they do so becausethey can find predicates that more specifically con-vey some desired semantic content.

5.3 Domain studyThe second study examines how to distinguishthose source-external predicates and PR pairs inin-domain articles that are used in a summary fromthose that are not. For this study, we rely on thetopic category divisions in the TAC 2010 data set,and define the in-domain text to be the documentsthat belong to the same topic category as the targetdocument cluster (but not including the target doc-ument cluster itself). This study demonstrates theimportance of better semantic understanding fordeveloping a text-to-text generation system thatuses in-domain text, and identifies potentially use-ful features for training such a system.

Nearest neighbour similarity. In the event-coreference step of the sentence enhancement al-gorithm, we relied on distributional semantics to

N NN simGS 2202 0.493 (0.486, 0.501)Non-GS 789K 0.443 (0.442, 0.443)

(a) Average similarity of gold-standard (GS) andnon-gold-standard (non-GS) PR pairs to the near-est neighbour in the source text.

N Freq. (millions) FecundityGS 1568 2.44 (2.05, 2.94) 21.6 (20.8, 22.5)non-GS 268K 0.85 (0.83, 0.87) 6.43 (6.41, 6.47)

(b) Average frequency and fecundity of GS and non-GS predicates inan external corpus. The differences are statistically significant (p <10−10).

Table 3: Results of the domain study. 95% confidence intervals are given in parentheses.

measure the similarity of arguments. Here, weexamine how well distributional similarity deter-mines the appropriateness of a source-external PRpair in a summary. Specifically, we measure itssimilarity to the nearest PR pair in the source text.To determine the similarity between two PR pairs,we compute the cosine similarity between theirvector representations. The vector representationof a PR pair is the concatenation of a context vec-tor for the predicate itself and a selectional pref-erences vector for the PR pair; that is, the vectorof counts with elements f(h, r, a) for fixed h andr. These vectors are trained from the AnnotatedGigaword corpus.

The average nearest-neighbour similarities ofPR pairs are shown in Table 3a. While the dif-ference between the gold-standard and non-gold-standard PR pairs is indeed statistically signifi-cant, the magnitude of the difference is not large.This illustrates the challenge of mining source-external text for abstractive summarization, anddemonstrates the need for a more structured ordetailed semantic representation in order to deter-mine the PR pairs that would be appropriate. Inother words, the kind of simple event coreferencemethod based solely on distributional semanticsthat we used in Section 3.3.1 is unlikely to be suf-ficient when moving beyond the source text.

Frequency and fecundity. We also explore sev-eral features that would be relevant to identifyingpredicates in in-domain text that are used in theautomatic summary. This is a difficult problem, asless than 0.6% of such predicates are actually usedin the source text. As a first step, we consider sev-eral simple measures of the frequency and charac-teristics of the predicates.

The first measure that we compute is the aver-age predicate frequency of the gold-standard andnon-gold-standard predicates in an external cor-pus, as in Section 5.2. A second, related mea-sure is to compute the number of possible relationsthat may occur with a given predicate. We call

this measure the fecundity of a predicate. Bothof these are computed with respect to the externalAnnotated Gigaword corpus, as before.

As shown in Table 3b, there is a dramatic dif-ference in both measures between gold-standardand non-gold-standard predicates in in-domain ar-ticles. Gold-standard predicates tend to be morecommon words compared to non-gold-standardones. This result is not in conflict with the re-sult in the provenance study that source-externalpredicates are less common words. Rather, it isa reminder that the background frequencies of thepredicates matter, and must be considered togetherwith the semantic appropriateness of the candidateword.

6 Conclusions

This paper introduced sentence enhancement asa method to incorporate information from multi-ple points in the source text into one output sen-tence in a fashion that is more flexible than previ-ous sentence fusion algorithms. Our results showthat sentence enhancement improves the contentand grammaticality of summary sentences com-pared to previous syntax-based sentence fusion ap-proaches. Then, we presented studies on the com-ponents of human-written summaries that are ex-ternal to the source text. Our analyses suggest thathuman summary writers look beyond the sourcetext to find predicates and relations that more pre-cisely express some target semantic content, andthat more sophisticated semantic techniques areneeded in order to exploit in-domain articles fortext-to-text generation in summarization.

Acknowledgments

We would like to thank the anonymous reviewersfor valuable suggestions. The first author was sup-ported by a Facebook PhD Fellowship during thecompletion of this research.

ReferencesRegina Barzilay and Kathleen R. McKeown. 2005.

Sentence fusion for multidocument news summa-rization. Computational Linguistics, 31(3):297–328.

Cosmin A. Bejan and Sanda Harabagiu. 2010. Unsu-pervised event coreference resolution with rich lin-guistic features. In Proceedings of the 48th AnnualMeeting of the Association for Computational Lin-guistics, pages 1412–1422. Association for Compu-tational Linguistics.

Jackie Chi Kit Cheung and Gerald Penn. 2013. To-wards robust abstractive multi-document summa-rization: A caseframe analysis of centrality and do-main. In Proceedings of the 51st Annual Meetingof the Association for Computational Linguistics,pages 1233–1242, August.

James Clarke and Mirella Lapata. 2008. Global in-ference for sentence compression: An integer linearprogramming approach. Journal of Artificial Intelli-gence Research(JAIR), 31:399–429.

Trevor Cohn and Mirella Lapata. 2008. Sentencecompression beyond word deletion. In Proceedingsof the 22nd International Conference on Compu-tational Linguistics (Coling 2008), pages 137–144,Manchester, UK, August. Coling 2008 OrganizingCommittee.

Micha Elsner and Deepak Santhanam. 2011. Learn-ing to fuse disparate sentences. In Proceedings ofthe Workshop on Monolingual Text-To-Text Gener-ation, pages 54–63. Association for ComputationalLinguistics.

Katja Filippova and Michael Strube. 2008. Sentencefusion via dependency graph compression. In Pro-ceedings of the 2008 Conference on Empirical Meth-ods in Natural Language Processing, pages 177–185, Honolulu, Hawaii, October. Association forComputational Linguistics.

Katja Filippova. 2010. Multi-sentence compression:Finding shortest paths in word graphs. In Proceed-ings of the 23rd International Conference on Com-putational Linguistics (Coling 2010), pages 322–330, Beijing, China, August. Coling 2010 Organiz-ing Committee.

Michel Galley and Kathleen McKeown. 2007. Lex-icalized Markov grammars for sentence compres-sion. In Human Language Technologies 2007:The Conference of the North American Chapter ofthe Association for Computational Linguistics; Pro-ceedings of the Main Conference, pages 180–187.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2012. Monolingual distributionalsimilarity for text-to-text generation. In Proceedingsof *SEM 2012: The First Joint Conference on Lex-ical and Computational Semantics, pages 256–264,Montreal, Canada, June. Association for Computa-tional Linguistics.

Hongyan Jing and Kathleen R. McKeown. 2000. Cutand paste based text summarization. In Proceed-ings of the 1st North American Chapter of the As-sociation for Computational Linguistics Conference,pages 178–185.

Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization—step one: Sentence compres-sion. In Proceedings of the National Conference onArtificial Intelligence, pages 703–710.

Chin-Yew Lin and Eduard Hovy. 2000. The automatedacquisition of topic signatures for text summariza-tion. In COLING 2000 Volume 1: The 18th In-ternational Conference on Computational Linguis-tics, COLING ’00, pages 495–501, Stroudsburg, PA,USA. Association for Computational Linguistics.

Erwin Marsi and Emiel Krahmer. 2005. Explorationsin sentence fusion. In Proceedings of the EuropeanWorkshop on Natural Language Generation, pages109–117.

Ryan T. McDonald. 2006. Discriminative sentencecompression with soft syntactic evidence. In 11thConference of the European Chapter of the Associa-tion for Computational Linguistics.

Courtney Napoles, Benjamin Van Durme, and ChrisCallison-Burch. 2011. Evaluating sentence com-pression: Pitfalls and suggested remedies. In Pro-ceedings of the Workshop on Monolingual Text-To-Text Generation, pages 91–97. Association for Com-putational Linguistics.

Courtney Napoles, Matthew Gormley, and BenjaminVan Durme. 2012. Annotated Gigaword. In Pro-ceedings of the NAACL-HLT Joint Workshop on Au-tomatic Knowledge Base Construction & Web-scaleKnowledge Extraction (AKBC-WEKEX), pages 95–100.

Karolina Owczarzak and Hoa T. Dang. 2010. TAC2010 guided summarization task guidelines.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, andinterpretable tree annotation. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and 44th Annual Meeting of the Associ-ation for Computational Linguistics.

Horacio Saggion and Guy Lapalme. 2002. Generat-ing indicative-informative summaries with SumUM.Computational Linguistics, 28(4):497–526.

Kapil Thadani and Kathleen McKeown. 2013. Super-vised sentence fusion with single-stage inference. InProceedings of the Sixth International Joint Confer-ence on Natural Language Processing, pages 1410–1418, Nagoya, Japan, October. Asian Federation ofNatural Language Processing.

Stephen Tratz and Eduard Hovy. 2008. Summariza-tion evaluation using transformed Basic Elements.In Proceedings of the First Text Analysis Conference(TAC).

Peter D. Turney and Patrick Pantel. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of Artificial Intelligence Research,37:141–188.

Joachim Wagner, Jennifer Foster, and Josef van Gen-abith. 2009. Judging grammaticality: Experi-ments in sentence classification. CALICO Journal,26(3):474–490.

Stephen Wan, Robert Dale, Mark Dras, and CecileParis. 2008. Seed and grow: Augmenting statisti-cally generated summary sentences using schematicword patterns. In Proceedings of the 2008 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 543–552, Honolulu, Hawaii, Oc-tober. Association for Computational Linguistics.

unsupervised sentence enhancement for automatic summarizationjcheung/papers/sent_exp_emnlp... ·...

Documents