vinayak gagrani neeraj toshniwal abhishek kabra guide pushpak bhattacharya

35
Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya Document Summarization

Upload: ismael-gafford

Post on 29-Mar-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra

Guide

Pushpak Bhattacharya

Document Summarization

Page 2: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

2

Introduction

Single Document Summarization

Multiple Document Summarization

Application

Evaluation

Conclusion

Outline

Page 3: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

3

What is Summary?

Text produced from one or more texts

Conveys important information in the original texts,

and that is no longer than half of the original texts.

3 important aspects of summary are:

Summaries should be short

Summaries should preserve important information

Summaries may be produced from single/multiple

documents

Introduction

Page 4: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

4

ExtractionProcedure of identifying important sections of text

and producing verbatim

AbstractionAim to produce material in a new way

FusionCombining extracted parts coherently

CompressionAims at throwing out unimportant sections of text

Common terms in summarization dialect

Page 5: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

5

Early Works

Machine Learning Methods

Naïve-Bayes Methods

Rich Features and Decision Trees

Deep Natural Language Analysis Methods

Lexical Chaining

Rhetorical Structure Theory (RST)

Single Document Summarization

Page 6: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

6

Luhn, 1958

Summarization based on measuring significance of

words depending on its frequency

Deriving significance factor of sentence, based on

number of significance words in that sentence

Edmundson, 1969

Word frequency and positional importance were

incorporated

Presence of cue words, and skeleton of the document

were also incorporated

Early Works

Page 7: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

7

Classifier based on applying Bayes theorem with strong independence assumption

s-particular sentence

S-set of sentences that make up the summary

F1…, Fk -the features

Assuming independence of features:

P(s ε S | F1,F2….Fk)=

Evaluation is done by analyzing its match with the human extracted document summary

Naïve Bayes Method

Page 8: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

8

Term frequency-inverse document frequencyIncreases proportionally to the number of

times a word appears in the documentoffset by the frequency of the word in the

corpusTakes into account that certain words are more

common than others. For e.g.. “the”, “is” etc.Idf(t,D)= log

|D|: total number of documents in the corpus: number of documents where the term t

appears i.e. tf(t,d) 0

Naïve Bayes Method

Page 9: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

9

Weighing sentences based on their positionArises from the idea that texts generally

follow a predictable discourse structureSentence position yield was calculated against

the topic keywords laterSentence position were then ranked by average

yield to produce Optimal Position Policy for topic positions for the genre

Later, sentence extraction problem was modeled using decision trees

assumption that features are independent broke away

Rich Features and Decision Trees

Page 10: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

10

Techniques aimed at modeling the text’s discourse

structureUse of heuristics to create document extractsLexical Chaining

independent of the grammatical structure of the text list of words that captures a portion of the cohesive

structure of the textsequence of related words in the text, spanning

short or long distances technique used to identify the central theme of a

document

Deep Natural Language Analysis Methods

Page 11: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

11

EllipsisWords are omitted when the phrase needs to

be repeatedExample:

A: Where are you going?B: To town.

SubstitutionWord is not omitted but replaced by anotherExample:

A: Which ice-cream would you like?B: I would like the pink one.

Forms of Cohesion

Page 12: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

12

ConjunctionRelationship between two clausesFew of them are: “and”, “then”, “however” etc.

RepetitionMentioning of the same word again

ReferenceAnaphoric reference

Refers to someone/something that has been previously identified

Cataphoric referenceForward referencing . Example: Here he comes….It’s

Brad Pitt

Forms of Cohesion

Page 13: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

13

Example:- John had mud pie for dessert. Mud

pie is made of chocolate. John really enjoyed

it.Steps involved in lexical chaining:

a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly

Lexical chaining

Page 14: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

14

relatedness measure-Wordnet Distance.

Weights assigned to chains based on their length and homogeneity

Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text

Corresponds to the significance of the textual context it embodies.

Provides a basis identifying the topical units in a document which are of great importance in document summarization.

Lexical Chaining

Page 15: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

15

two non-overlapping pieces of text spans: the nucleus and the satellite

Nuclei expresses what is more essential to the writer's purpose than the satellite

Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans.

claim is more essential to the text than the particular evidence

claim span a nucleus and the evidence span a satellite Nucleus is independent of the satellite but not vice

versa

Rhetorical Structure Theory(RST)

Page 16: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

16

Rhetorical Structure Theory(RST)

Page 17: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

17

Need and EncouragementExtraction of single summary from multiple

documents started in mid 1990sMost of the application in news article

Google news (news.google.com)Columbia news blaster

(newsblaster.cs.columbia.edu)News in Essence (NewsInEssence.com)

Multiple source of information which are :- supplementary to each other overlapping in content even contradictory at time

Multiple Document Summarization

Page 18: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

18

Extended template driven message understanding system Abstractive System, rely heavily on internal NLP tools

Earlier considered as knowledge of Language Interpretation Generation

Extractive Techniques have been applied - Similarity measures between sentences identify common theme through clustering - select one

sentence to represent each cluster generate composite sentence from each cluster

Summarization differs on what the final goal is MEAD : works based on extraction techniques on general

domains SUMMONS : build a briefing highlighting difference and

updates on news report

Early Work

Page 19: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

19

SUMMONS is the first example of multi-document summarization

Considers event about a narrow domainnews articles about terrorism

It produces a briefing merging relevant information about event and their evolution over time

It reads a database built by template based message understanding system

Concatenation of two systems : Content Planner and Linguistic Generator

Abstractions and Information Fusion

Page 20: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

20

Content Planner : selects information to include in summary through combination of input templates

It uses summary operators - set of heuristics that perform operations like :change of perspective, contradiction, refinement

Linguistic Generator :selects the right words to express the information in grammatical and coherent text.

Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE

SUMMONS - processing the text (Content Planner)

Page 21: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

21

Themes - set of similar text units (Paragraphs) - Clustering Problem

Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs

For each pair of paragraphs a vector is computed which represents matches on different features.

Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme

Information Fusion - which sentences of the theme should be included in the final summary.

Theme based approach - McKeown et al., Barzilay et al.

Page 22: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

22

Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary

Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate-argument structure, identify functional roles.

Comparison algorithm traverses the tree recursively, adding identical nodes to output tree.

Once full phrase are found, they are marked to be included in summary.

Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system.

Information Fusion

Page 23: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

23

Decision Tree

“McVeigh, 27,was charged with the bombing”

Page 24: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

24

MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein

Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures.

Q - query or user profile, R - Ranked list of documents, S - already selected documents .

Select a document one at a time and add them to S. For each document in Di in R\S,

MR(Di) = a * Sim1(Di,Q) - (1-a) * max Di in S Sim2(Di,Dj), where a lies in [0,1]

Document getting maximum MR(Di) is selected until maximum number is reached or threshold is reached, a controls the relative importance between relevance and

redundancy. Sim1 and Sim2 are similarity measures ( cosine similarity measure )

Topic-Driven Summarization

Page 25: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

25

Content is denoted as entities and relations as nodes and edges of a graph.  

Rather than extracting sentences, they detect salient regions of the graph. 

Topic Driven : topic is denoted by entry nodes in graph.

Graph :Each node is single occurrence of word.  Different kind of links – Adjacency links, Same

links, Alpha Links and Phrase links, Name and Coref Links 

Graph Spreading Activation

Page 26: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

26

Topic nodes are identified through stem comparison and marked as entry node.  

Spreading activation: search for semantically related text is propagated from these to other nodes of the graph. 

Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance. 

Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores. 

User is able to specify the maximal number to control the output. 

Graph Spreading Activation

Page 27: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

27

It does not use any language generation module. Easily scalable and domain-independent

Topic Detection - Group together news articles that describe the same event.

An agglomerative clustering algorithm is used, operates on TF-IDF vector representations, successively adding documents to clusters and re computing the centroids according to

cj is the centroid of the j-th cluster, Cj the set of documents that belong to that cluster

Centroids can thus be considered as pseudo-documents that include those words whose TF-IDF scores are above a threshold in their cluster.

Centroid-based Summarization

Page 28: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

28

Second Stage - Identify sentences that are central to topic of the entire cluster.

Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 Cluster-based relative utility (CBRU) - how relevant a

particular sentence to general topic of cluster Cross-sentence Informational subsumption (CSIS) - measure of

redundancy among sentences Given a cluster segmented into n sentences, and compression rate

R, we select nR sentences in order of appearance in chronologically arranged documents

Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence Centroid Value (Ci) sum of centroid values of all the words in

sentence Positional Value(Pi) makes leading sentences more important First sentence Overlap (Fi) - inner product of word occurrence

vector of sentence I and that of 1st sentence of document

Centroid-based Summarization

Page 29: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

29

Google News:news aggregator, selecting most up-to-date(within

the past 30 days) information from thousands of publications by an automatic aggregation algorithm

Different versions available for more than 60 regions in 28 languages

Ultimate research Assistant:performs text mining on Internet search results make it easier for the user to perform online

research by organizing the output. Type name of a topic and it will search the web for

highly relevant resources, and organize the search results

Application

Page 30: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

30

Shablast Universal search engine Produces multi-document summaries from the

top 50 results returned by Microsoft's Bing search engine for a set of keywords.

iResearch Reporter – Commercial Text Extraction and Text

Summarization systemProduces categorized, easily-readable natural

language summary reports covering multiple documents retrieved by entering user query in google search engine

Application

Page 31: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

31

Application

Page 32: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

32

A difficult taskAbsence of a standard human or automatic evaluation

metricmakes difficult to compare different systems and

establish a baselineManual evaluation not feasibleNeed for an evaluation metric having high correlation

with human scores human and automatic evaluation:

Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences

Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries

Evaluation

Page 33: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

33

ROUGE based only on content overlap can determine if the same general concepts are discussed

between an automatic summary and a reference summary cannot determine if the result is coherent or the sentences

flow together in a sensible manner Better in case of single document summarization

Information-theoretic Evaluation of Summaries Central idea is to use a divergence measure between a pair of

probability distributions First distribution is derived from automatic summary Second from a set of reference summaries

Suits both the single document and multi document summarization scenarios

Evaluation

Page 34: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

34

Need to develop efficient and accurate

summarization systems due to enormous rate of

information growth

Still a lot of research going on this field especially

in evaluation techniques

Multi document summarization is more in use as

compared to single-document summarization

Extractive techniques are employed usually rather

than abstractive techniques as they are easy to

employ and have produced satisfactory results

Conclusion

Page 35: Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

35

A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins (http://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf)

Wikipedia Relevance of cluster size in MMR Based

summarizer (http://www.cs.cmu.edu/~madhavi/publications/Ganapathiraju_11-742Report.pdf)

References