summary generation keith trnka
DESCRIPTION
Summary Generation Keith Trnka. The approach. Apply Marcu's basic summarizer (1999) to perform content selection Re-generate the selected content so that it's more natural. RST Refresher. A text is composed of elementary discourse units (EDUs) - PowerPoint PPT PresentationTRANSCRIPT
Summary GenerationKeith Trnka
The approach
● Apply Marcu's basic summarizer (1999) to perform content selection
● Re-generate the selected content so that it's more natural
RST Refresher
● A text is composed of elementary discourse units (EDUs)– What constitutes an EDU varies from author to author– Common consensus that they are no larger than
sentences● Text spans
– An EDU is a text span– A sequence of adjacent text spans in some rhetorical
relation is a text span
RST Refresher (cont'd)
● A rhetorical relation is the relationship between text spans– Some relations have the notion of nuclearity:
one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate
● These relations are called mononuclear● Example: [When I got home,] circumstance-for [I was
tired]
– Other spans are called multinuclear● There is no most-important sub-span● Example: [Cats scratch] contrast-with [, but dogs bite.]
RST Discourse Treebank
● RST analyses of 385 WSJ articles from Penn Treebank
● Available from LDC (http://www.ldc.upenn.edu)● Overview can be found in (Carlson et. al. 2001)● Annotation manual is (Carlson, Marcu 2001)● Thanks to the department for buying it
● Notes about the annotation– EDUs are clause-like– Mono-nuclear relations were forced to be binary– Relative clauses and appositives can be embedded
relations
RST Discourse Treebank (cont'd)
RST Discourse Treebank (cont'd)
● Statistical analysis of 335 training documents– 98% of spans are binary (two children)– For binary mononuclear relations:
● Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority
Relation Frequency N-S Order S-N OrderElaboration-additional 20.44% 99.79% 0.17%Attribution 17.19% 32.34% 67.42%elaboration-object-attribute-e 16.13% 99.96% 0.04%Elaboration-additional-e 5.22% 99.06% 0.94%Circumstance 3.95% 55.26% 44.56%Explanation-argumentative 3.61% 96.88% 2.34%
Marcu's Content Selection Algorithm
● Described in (Marcu 1999)● Promotion sets
– The promotion set of each span is the union of all promotion sets of nuclear sub-spans
– The promotion set of an EDU is the EDU itself
Marcu's Content Selection Algorithm (cont'd)● Build a partial ordering of EDUs*
– For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span.
– The rank of each EDU is● If the EDU is in an embedded relation, d + 1● Otherwise, d
– Example of the partial ordering
*re-worded from Marcu's description
Marcu's Content Selection Algorithm (cont'd)● Given a summary length requirement
– Select the topmost EDU groups until it isn't possible to select more and honor the length requirement
– Effect: can't always generate a summary as close to desired length as possible
Generation desiderata
● Removal of problems– Dangling references– Dangling discourse markers
● Introduction of coherence– Generate smaller referring expressions– Generate discourse markers when appropriate
Example
Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.
The theoretical approach
● Content selection– Marcu's summarization algorithm
● Paragraph generation– Organize sentences into paragraphs
● Sentence generation– Construct complete sentences from EDUs
The theoretical approach (cont'd)
● Discourse marker generation– Remove discourse markers that refer to removed text
spans– Generate discourse markers when none exists and one
is appropriate● Referring expression generation
– Generate the best unambiguous referring expressions● Shorter is better● Faster to interpret is better
The implemented approach
● Content selection– Marcu's algorithm as stated
● Paragraph generation– Not implemented
Implementation: Sentence “generation”● If a selected group of EDUs is an entire text span
– select them all as-is, uppercase the front and make sure it ends with punctuation
● If a selected group of EDUs is an entire text span, except for some embedded relations– Remove punctuation associated with embeddings, add
sentence terminators from embeddings
● If a selected group of EDUs is a sentence– Select as-is
● If a selected EDU isn't part of such a group– uppercase the front and end with punctuation
Implementation: Discourse marker generation● Train to see which discourse markers go with
which relations● In generation, select discourse markers with a
probability > 80%
Training on discourse markers
● Discourse markers identified by string matching at beginning and ending of each EDU
● List of markers taken from (Knott 1994)
Training on discourse markers (cont'd)● Three statistics trained on binary, atomic spans
with zero or one markers– Inclusion
– Usage
– Position
P include a marker | relation
P marker = m | include , relation
P position 1, 2 start , end | marker , include , n-s order
Rough evaluation
● Sentence “generation” isn't much different from not changing it at all– Except embedded relation removal
● Out of 347 summaries, a discourse marker was only generated once– Ms. Johnson is awed by the earthquake's destructive
force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.
Desired approach: Content selection● Marcu's algorithm can only select groups of
EDUs– Sometimes produces overly short summaries or
nothing at all– If a preferential ordering could be defined within
equivalence, summaries could meet the desired length better
● EDUs tied to more salient EDUs have their score boosted
Desired approach:Paragraph generation● Paragraphs in the source document are marked
– Leave paragraph boundaries intact if they form large enough paragraphs
– A shallow method, but has potential● Correlate paragraph boundaries with something
– RS-tree structure– Co-reference chain beginnings/endings– Topical text segments, by an extension of Heart's text
segmentation algorithm (Hearst 1994)
Desired approach: Sentence generation● Apply shallow parsing to understand the rough
syntactic structure of an EDU● Relative clauses can be attached and full
sentences generated like (Siddharthan 2004)
Desired approach:Discourse marker generation● The probabilities computed in DM training aren't
the best– Need to attach discourse markers and recompute,
repeat until stable– The attachment algorithm involves a constraint-
satisfaction problem● DM attachment needed to perform DM removal● A DM generator should understand syntax better
– When should commas be included and where?
Desired approach:Referring expression generation● Requires good co-reference resolution
– A reference resolver requires (at least) a base noun phrase chunker
– EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach
● Mitkov (2002) describes Hobbs' naïve approach
● Generation algorithm only adds the creation of a list of referring expressions, ordered by preference
Conclusions
● Document length is poorly defined– Quite a bit of variation between EDU length, word
length, and character length● Attaching discourse markers to the relation they
realize is tough● Representing natural language in programs can
be tough● Summarization of quotations requires special
treatment
References
● Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001.
● Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001.
● Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994.
● Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62.
● William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.
References (cont'd)
● Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123-136, The MIT Press.
– I think this is a cleanup of his earlier work from 1997.
● Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education.
● Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.