automatic summarization of news using wordnet concept graphs laura plaza, alberto díaz, pablo...
TRANSCRIPT
Automatic Automatic Summarization Summarization of News using of News using
WordNet WordNet Concept GraphsConcept Graphs
Laura Plaza, Alberto Díaz, Pablo Gervás
Departamento de Ingeniería del Software e Inteligencia ArtificialUniversidad Complutense de Madrid
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Introduction
2
ª Digital newspapers have increased explosively
ª In order to tackle this information overload, text summarization can play a role
ª More linguistic and ontological resources that can benefit NLP systems
We present an ontology-based method for automatic summarization of news items
• based on mapping the text to concepts in WordNet and representing the document and its sentences as concept graphs
INTRODUCTION
PREVIOUS WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Related Work
3
ª Text summarization extractive methods• Segments of text (usually sentences) that contain the
most relevant information are extracted • Classic methods compute the score of a sentence
using a linear combination of the weights that result from a set of different features
positional, linguistic, statistical • Graph-based methods generally represent the
sentences by its tf*idf vectors and compute sentence connectivity using the cosine similarity
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Related Work
4
ª Most approaches do not capture the semantic relations between terms (synonymy, hypernymy, homonymy, co-occurrence …)
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
1. “Hurricanes are useful to the climate machine. Their primary role is to transport heat from the lower to the upper atmosphere,'' he said.
2. He explained that cyclones are part of the atmospheric circulation mechanism, as they move heat from the superior to the inferior atmosphere.
Concept graph-based approach
same meaning!!!
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Related Work
5
ª Graph-based document clustering for extractive multi-document summarization (CSUGAR, Yoo et al. 2007)
• The entire collection of documents is represented as an extended graph of MeSH concepts
• This graph is clustered using an algorithm based on the Scale-Free Network Theory that generates, for each cluster: a graph and a set of vertices as centroids (Hub Vertices Set, HVS)
• Each document is assigned to a cluster using a vote mechanism based on whether the vertices in the document belong to HVS or non-HVS
• All documents assigned to a same clusters contribute to produce a single summary
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
WordNet
6
ª An electronic lexical database developed at the Princeton University (Miller et al, 1993)
ª Words with the same meaning are grouped in a Synset
ª Each synset has a gloss that defines it
ª Synsets are connected via semantic relations:
• Hypernymy and Hyponymy:
feline is a hypernym of catcat is a hyponym of feline
• Holonymy and Meronymy:
vehicle is a holonym of wheelwheel is a meronym of vehicle
• Troponymy: to lisp is a troponym of to talk• Entailment: to sleep is entailed by to snore
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Summarization Method
7
ª An extractive mono-document method that selects sentences based on their similarity with concept clusters
ª Steps:• Preprocessing• Graph-based Document Representation• Concept Clustering and Theme Recognition• Sentence Selection
ª A worked document example from the DUC 2002 collection will illustrate the method
INTRODUCTION
PREVIOUS WORK
UMLS
SUMMARIZATION
METHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Preprocessing
8
ª Extracting the body of the news itemsª Splitting the text into sentences using
GATE (http://gate.ac.uk) ª Generic words and frequent terms are
removed• They are not useful in discriminating between
relevant and non relevant sentences
INTRODUCTION
PREVIOUS WORK
UMLS
SUMMARIZATION
METHOD
PREPROCESSING
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation
9
ª Representing the news item body as a graph
• Where the vertices are the concepts in WordNet associated to the terms
• And the edges indicate the relations between them
ª Steps1. Mapping between terms and concepts2. Graph-based Sentence Representation3. Graph-based Document Representation
INTRODUCTION
PREVIOUS WORK
UMLS
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENT
REPRESENTATION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation
10
1. Mapping between terms and concepts
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENT
REPRESENTATION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
WordNet Sense Relate (word sense
disambiguation)
Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for
high winds, heavy rains and high seas.
Concept WN Sense Concept WN Sense Concept WN Sensehurricane 1 defense 9 prepare 4Gilbert 2 alert 1 high 2sweep 1 heavily 2 wind 1Dominican Rep 1 populate 2 heavy 1sunday 1 south 1 rain 1civil 1 coast 1 sea 1
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation
11
2. Graph-based Sentence Representation• WordNet concepts derived from common nouns
are extended with their hypernyms
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENT
REPRESENTATION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
physical entity
physical object
abstract entity
abstraction
group
social group
organization
defense
measure
fundamentalquantity
time period
calendar day
entity
process
phenomenon
naturalphenomenon
physicalphenomenon
atmosphericphenomenon
windstorm
cyclone
hurricane
location
region
territory
territorialdivision
country day of the week
rest day
sunday
geologicalformation
shore
coast
weather
wind precipitation
rain
thing
Body ofwater
sea
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation
12
3. Graph-based Document Representation
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENT
REPRESENTATION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
• The sentence graphs are merged into a document graph
• Enriched with similarity relations between leaf concepts, when similarity > threshold
• Each edge is assigned a weight directly proportional to the depth of the concepts that it connects (Yoo et al. 2007)
WordNet Similarity Package
physical entity physical object
location
region
territory territorial division
country
geo. formation shore coast
thing
body of water sea
1/2 1/2 2/3 2/3
2/3 3/4 3/4 3/4 4/5 4/5
5/6
6/7
4/5
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Concept Clustering and Theme Recognition
13
ª The document graph is clustered in concept subgraphs, using a clustering algorithm similar to that used in CSUGAR (Yoo et al., 2007)
• a subgraph and a HVS are obtained for each cluster
ª Each cluster is composed by a set of concepts that are closely related in meaning
• It can be seen as a theme in the document
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENTREPRESENTATION
CONCEPT CLUSTERING ANDTHEME RECOGNITION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Sentence Selection
14
ª Selecting significant sentences for the summary, based on the similarity between sentences and clusters
ª The similarity between a cluster Ci and a sentence Oj is measured with a vote mechanism that is based on whether the vertices of the sentence graph belongs to HVS or non-HVS.
simililarity (Ci, Oj) =
5.0)(
0.1)(
0
,
,
,
jkik
jkik
jkik
wCHVSv
wCHVSv
wCv
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENTREPRESENTATION
CONCEPT CLUSTERING &THEME RECOGNITION
SENTENCE SELECTION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
jkk Ovv
jkji wOCsimilitud ,),(
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Sentence Selection
15
ª Three heuristics are tested
• Heuristic 1: for each cluster, the top sentences (in a number proportional to its size) are selected
• Heuristic 2: all sentences are selected from the cluster with more concepts
• Heuristic 3: a single score for each sentence is computed, as the sum of the similarity assigned to each cluster adjusted to their sizes, and the sentences with higher scores are selected
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATION
METHOD
PREPROCESSING
GRAPH - BASEDDOCUMENTREPRESENTATION
CONCEPT CLUSTERING &THEME RECOGNITION
SENTENCE SELECTION
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Experimental Results
16
ª Two separate groups of experiments
• Parameter setting
• Comparison with a positional lead baseline
ª Evaluation Collection
• 10 news items from the DUC 2002 collection
• Each item comes with (at least) a model abstract written by a human
ª Compression rate of 30%ª Evaluation Metric
• ROUGE recall measures: R-1, R-2, R-L, R-W-1.2
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Experimental Results
17
ª Algorithm parametrization
• Parameter 1: Which heuristic produces the best summaries?
• Parameter 2: What percentage of vertices should be considered as hub vertices (HV) by the clustering method?
• Parameter 3: Is it better to consider both types of relations between concepts (hypernymy and similarity) or just the hypernymy relation?
• Parameter 4: If the similarity relation is taken into account, what similarity threshold should be considered?
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Experimental Results
18
ª Parameters 1 & 2: Heuristic and Hub Vertices percentage
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
Average R-1
Average R-2
Average R-L
Average R-W-1.2
Heuristic 1
1- percent 0,69324 0,33723 0,65202 0,24847
2- percent 0,71474 0,34520 0,69115 0,26225
5- percent 0,67814 0,28308 0,63836 0,23646
10- percent 0,67201 0,27093 0,62717 0,22283
Heuristic 2
1- percent 0,71446 0,33185 0,67367 0,25384
2- percent 0,72487 0,34438 0,68040 0,25810
5- percent 0,70358 0,30756 0,65924 0,24488
10- percent 0,72449 0,32887 0,67966 0,25547
Heuristic 3
1- percent 0,72056 0,34105 0,68058 0,25727
2- percent 0,72755 0,34438 0,68308 0,25886
5- percent 0,70560 0,31164 0,66377 0,24612
10- percent 0,71273 0,31980 0,66888 0,25162
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Experimental Results
19
ª Parameter 3: Semantic Relations
ª Parameter 4: Similarity threshold
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
Average R-1 Average R-2 Average R-L Average R-W-1.2
Heuristic 1Hypernymy 0,71470 0,32736 0,67565 0,25250
Hyp. & Sim. 0,72736 0,34230 0,68558 0,25681
Heuristic 2Hypernymy 0,72487 0,34438 0,68040 0,25810
Hyp. & Sim. 0,72920 0,33949 0,68463 0,25664
Heuristic 3Hypernymy 0,72755 0,34438 0,68308 0,25886
Hyp. & Sim. 0,73118 0,32941 0,67838 0,25323
Average R-1 Average R-2 Average R-L Average R-W-1.20.01 0,71145 0,31381 0,66314 0,248720.05 0,71470 0,32736 0,67565 0,25250
0.1 0,71953 0,32573 0,67578 0,254770.2 0,73118 0,32941 0,67838 0,253230.5 0,71058 0,31786 0,66690 0,24892
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Experimental Results
20
ª Best Parametrization
ª Comparison with a positional lead baseline
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
• Heuristic: Undetermined
• Hub Vertices Percentage: 2%
• Relations : Hypernymy and similarity
• Similarity threshold: 0.2
Average R-1 Average R-2 Average R-L Average R-W-1.2Best Configuration 0,73118 0,32941 0,67838 0,25323
Lead Baseline 0,59436 0,18826 0,55522 0,20488
Conclusions
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Conclusions
21
ª Representing the text using ontology concepts allows a richer representation than the one provided by a vector space model
ª The method proposed can be applied to documents from different domains
ª Previously tested on biomedical scientific literature
ª Preliminary evaluation
• There are not significant differences between the heuristics
• Results are clearly better than the baseline
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
Conclusions
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Future Work
22
ª Long sentences have a higher probability of being selected
• Normalizing the sentence scores overthe number of concepts
ª Performing a large-scale evaluation on the DUC 2002 Corpus (ROUGE)
ª Comparing with similar systems (i.e. LexRank)
ª Extending the method to multidocument summarization
INTRODUCTION
RELATED WORK
WORDNET
SUMMARIZATIONMETHOD
EXPERIMENTAL RESULTS
CONCLUSIONS FUTURE WORK
Thank you!Thank you!
Conclusions
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Clustering Algorithm
24
1. We compute the salience of all vertices in the document graph
2. The n vertices with higher salience (Hub Vertices) are iteratively grouped in Hub Vertex Sets (centroids)
3. The remaining vertices (non HVS) are assigned to that the cluster to which they are more connected
4. This process repeats iteratively, adjusting both HVS and the remaining vertices progressively
),(
)()(kijkj vvconectaeve
ji eweightvsalience
Conclusions
IADIS Informatics 2009
Automatic Summarization of News using WordNet Concept Graphs
Rouge
25
ª Recall-Oriented Understudy for Gisting Evaluation
ª Comparison between automatic summaries and model or ideal summaries created by humans
ª Different recall metrics• ROUGE-N (1…4) : number or ngrams co-ocurring in a candidate
summary and the reference summaries
• ROUGE- L (Longest Common Subsequence): the number of the LCS is used to estimate similarity between the candidate and the reference summaries
• ROUGE-W-1.2 (Weighted Longest Common Subsequence): improves ROUGE-W by taking into account the presence of consecutive matches