vinayak gagrani neeraj toshniwal abhishek kabra guide pushpak bhattacharya
TRANSCRIPT
Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra
Guide
Pushpak Bhattacharya
Document Summarization
2
Introduction
Single Document Summarization
Multiple Document Summarization
Application
Evaluation
Conclusion
Outline
3
What is Summary?
Text produced from one or more texts
Conveys important information in the original texts,
and that is no longer than half of the original texts.
3 important aspects of summary are:
Summaries should be short
Summaries should preserve important information
Summaries may be produced from single/multiple
documents
Introduction
4
ExtractionProcedure of identifying important sections of text
and producing verbatim
AbstractionAim to produce material in a new way
FusionCombining extracted parts coherently
CompressionAims at throwing out unimportant sections of text
Common terms in summarization dialect
5
Early Works
Machine Learning Methods
Naïve-Bayes Methods
Rich Features and Decision Trees
Deep Natural Language Analysis Methods
Lexical Chaining
Rhetorical Structure Theory (RST)
Single Document Summarization
6
Luhn, 1958
Summarization based on measuring significance of
words depending on its frequency
Deriving significance factor of sentence, based on
number of significance words in that sentence
Edmundson, 1969
Word frequency and positional importance were
incorporated
Presence of cue words, and skeleton of the document
were also incorporated
Early Works
7
Classifier based on applying Bayes theorem with strong independence assumption
s-particular sentence
S-set of sentences that make up the summary
F1…, Fk -the features
Assuming independence of features:
P(s ε S | F1,F2….Fk)=
Evaluation is done by analyzing its match with the human extracted document summary
Naïve Bayes Method
8
Term frequency-inverse document frequencyIncreases proportionally to the number of
times a word appears in the documentoffset by the frequency of the word in the
corpusTakes into account that certain words are more
common than others. For e.g.. “the”, “is” etc.Idf(t,D)= log
|D|: total number of documents in the corpus: number of documents where the term t
appears i.e. tf(t,d) 0
Naïve Bayes Method
9
Weighing sentences based on their positionArises from the idea that texts generally
follow a predictable discourse structureSentence position yield was calculated against
the topic keywords laterSentence position were then ranked by average
yield to produce Optimal Position Policy for topic positions for the genre
Later, sentence extraction problem was modeled using decision trees
assumption that features are independent broke away
Rich Features and Decision Trees
10
Techniques aimed at modeling the text’s discourse
structureUse of heuristics to create document extractsLexical Chaining
independent of the grammatical structure of the text list of words that captures a portion of the cohesive
structure of the textsequence of related words in the text, spanning
short or long distances technique used to identify the central theme of a
document
Deep Natural Language Analysis Methods
11
EllipsisWords are omitted when the phrase needs to
be repeatedExample:
A: Where are you going?B: To town.
SubstitutionWord is not omitted but replaced by anotherExample:
A: Which ice-cream would you like?B: I would like the pink one.
Forms of Cohesion
12
ConjunctionRelationship between two clausesFew of them are: “and”, “then”, “however” etc.
RepetitionMentioning of the same word again
ReferenceAnaphoric reference
Refers to someone/something that has been previously identified
Cataphoric referenceForward referencing . Example: Here he comes….It’s
Brad Pitt
Forms of Cohesion
13
Example:- John had mud pie for dessert. Mud
pie is made of chocolate. John really enjoyed
it.Steps involved in lexical chaining:
a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly
Lexical chaining
14
relatedness measure-Wordnet Distance.
Weights assigned to chains based on their length and homogeneity
Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text
Corresponds to the significance of the textual context it embodies.
Provides a basis identifying the topical units in a document which are of great importance in document summarization.
Lexical Chaining
15
two non-overlapping pieces of text spans: the nucleus and the satellite
Nuclei expresses what is more essential to the writer's purpose than the satellite
Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans.
claim is more essential to the text than the particular evidence
claim span a nucleus and the evidence span a satellite Nucleus is independent of the satellite but not vice
versa
Rhetorical Structure Theory(RST)
16
Rhetorical Structure Theory(RST)
17
Need and EncouragementExtraction of single summary from multiple
documents started in mid 1990sMost of the application in news article
Google news (news.google.com)Columbia news blaster
(newsblaster.cs.columbia.edu)News in Essence (NewsInEssence.com)
Multiple source of information which are :- supplementary to each other overlapping in content even contradictory at time
Multiple Document Summarization
18
Extended template driven message understanding system Abstractive System, rely heavily on internal NLP tools
Earlier considered as knowledge of Language Interpretation Generation
Extractive Techniques have been applied - Similarity measures between sentences identify common theme through clustering - select one
sentence to represent each cluster generate composite sentence from each cluster
Summarization differs on what the final goal is MEAD : works based on extraction techniques on general
domains SUMMONS : build a briefing highlighting difference and
updates on news report
Early Work
19
SUMMONS is the first example of multi-document summarization
Considers event about a narrow domainnews articles about terrorism
It produces a briefing merging relevant information about event and their evolution over time
It reads a database built by template based message understanding system
Concatenation of two systems : Content Planner and Linguistic Generator
Abstractions and Information Fusion
20
Content Planner : selects information to include in summary through combination of input templates
It uses summary operators - set of heuristics that perform operations like :change of perspective, contradiction, refinement
Linguistic Generator :selects the right words to express the information in grammatical and coherent text.
Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE
SUMMONS - processing the text (Content Planner)
21
Themes - set of similar text units (Paragraphs) - Clustering Problem
Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs
For each pair of paragraphs a vector is computed which represents matches on different features.
Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme
Information Fusion - which sentences of the theme should be included in the final summary.
Theme based approach - McKeown et al., Barzilay et al.
22
Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary
Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate-argument structure, identify functional roles.
Comparison algorithm traverses the tree recursively, adding identical nodes to output tree.
Once full phrase are found, they are marked to be included in summary.
Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system.
Information Fusion
23
Decision Tree
“McVeigh, 27,was charged with the bombing”
24
MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein
Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures.
Q - query or user profile, R - Ranked list of documents, S - already selected documents .
Select a document one at a time and add them to S. For each document in Di in R\S,
MR(Di) = a * Sim1(Di,Q) - (1-a) * max Di in S Sim2(Di,Dj), where a lies in [0,1]
Document getting maximum MR(Di) is selected until maximum number is reached or threshold is reached, a controls the relative importance between relevance and
redundancy. Sim1 and Sim2 are similarity measures ( cosine similarity measure )
Topic-Driven Summarization
25
Content is denoted as entities and relations as nodes and edges of a graph.
Rather than extracting sentences, they detect salient regions of the graph.
Topic Driven : topic is denoted by entry nodes in graph.
Graph :Each node is single occurrence of word. Different kind of links – Adjacency links, Same
links, Alpha Links and Phrase links, Name and Coref Links
Graph Spreading Activation
26
Topic nodes are identified through stem comparison and marked as entry node.
Spreading activation: search for semantically related text is propagated from these to other nodes of the graph.
Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance.
Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores.
User is able to specify the maximal number to control the output.
Graph Spreading Activation
27
It does not use any language generation module. Easily scalable and domain-independent
Topic Detection - Group together news articles that describe the same event.
An agglomerative clustering algorithm is used, operates on TF-IDF vector representations, successively adding documents to clusters and re computing the centroids according to
cj is the centroid of the j-th cluster, Cj the set of documents that belong to that cluster
Centroids can thus be considered as pseudo-documents that include those words whose TF-IDF scores are above a threshold in their cluster.
Centroid-based Summarization
28
Second Stage - Identify sentences that are central to topic of the entire cluster.
Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 Cluster-based relative utility (CBRU) - how relevant a
particular sentence to general topic of cluster Cross-sentence Informational subsumption (CSIS) - measure of
redundancy among sentences Given a cluster segmented into n sentences, and compression rate
R, we select nR sentences in order of appearance in chronologically arranged documents
Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence Centroid Value (Ci) sum of centroid values of all the words in
sentence Positional Value(Pi) makes leading sentences more important First sentence Overlap (Fi) - inner product of word occurrence
vector of sentence I and that of 1st sentence of document
Centroid-based Summarization
29
Google News:news aggregator, selecting most up-to-date(within
the past 30 days) information from thousands of publications by an automatic aggregation algorithm
Different versions available for more than 60 regions in 28 languages
Ultimate research Assistant:performs text mining on Internet search results make it easier for the user to perform online
research by organizing the output. Type name of a topic and it will search the web for
highly relevant resources, and organize the search results
Application
30
Shablast Universal search engine Produces multi-document summaries from the
top 50 results returned by Microsoft's Bing search engine for a set of keywords.
iResearch Reporter – Commercial Text Extraction and Text
Summarization systemProduces categorized, easily-readable natural
language summary reports covering multiple documents retrieved by entering user query in google search engine
Application
31
Application
32
A difficult taskAbsence of a standard human or automatic evaluation
metricmakes difficult to compare different systems and
establish a baselineManual evaluation not feasibleNeed for an evaluation metric having high correlation
with human scores human and automatic evaluation:
Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences
Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries
Evaluation
33
ROUGE based only on content overlap can determine if the same general concepts are discussed
between an automatic summary and a reference summary cannot determine if the result is coherent or the sentences
flow together in a sensible manner Better in case of single document summarization
Information-theoretic Evaluation of Summaries Central idea is to use a divergence measure between a pair of
probability distributions First distribution is derived from automatic summary Second from a set of reference summaries
Suits both the single document and multi document summarization scenarios
Evaluation
34
Need to develop efficient and accurate
summarization systems due to enormous rate of
information growth
Still a lot of research going on this field especially
in evaluation techniques
Multi document summarization is more in use as
compared to single-document summarization
Extractive techniques are employed usually rather
than abstractive techniques as they are easy to
employ and have produced satisfactory results
Conclusion
35
A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins (http://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf)
Wikipedia Relevance of cluster size in MMR Based
summarizer (http://www.cs.cmu.edu/~madhavi/publications/Ganapathiraju_11-742Report.pdf)
References