a new multi-document summarization system yi guo and gorge stylios heriot-watt university, scotland,...

A New Multi-document Summarization System

Yi Guo and Gorge Stylios

Heriot-Watt University, Scotland, U.K.(DUC2003)

2/25

Abstract A document understanding system has been develop

ed, AMDS_hw, which is based on the synergy of English grammar parsing with sentence clustering and reduction.

The system is capable of producing a summary of single or multiple documents, but the present study only focuses on multi-document summary.

After a thorough and objective evaluation on task 2 (non-question related task), the system has shown to perform better in Mean Coverage, Mean Length-Adjusted Coverage and Quality Question Score in comparison with other systems

3/25

Introduction This paper describes the structure and algorithms of

a newly developed multi-document summarization system, which is a hybrid of a number of related techniques.

This system is designed to produce a generic summary, rather than a biased summary towards some topic of special interest or purpose, for a set of multiple documents, no matter whether or not they are closely related with each other

4/25

System Structure & Algorithms

5/25

System Structure & Algorithms Content Reconstruction

The module takes the original documents as the input data, uses basic text processing techniques to divide documents into many paragraphs or segments and finally into sentences.

Each sentence-unit carries not only the original content of the sentence but also some important initial information, such as in which document and in which sentence it belongs to, its position and the relative positions in the paragraph and document.

6/25

System Structure & Algorithms Content Reconstruction

Each sentence-unit is composed by two parts, the content-section and the information-section.

The Content-section saves the original content of the sentences

The information-section stores other important information about the sentence

The content-section of each sentence-unit is fixed, but the information-section is very flexible and extensive.

7/25

System Structure & Algorithms Syntactic Parsing

The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax [Sleator et a. 2000].

The Link Grammar Parser assigns a syntactic structure, which consists of a set of labeled links connecting pairs of words, and produces a postscript and a constituent tree for the content-section of each sentence-unit in the pool of sentence-units.

All these postscripts and constituents trees are added to the information-section of corresponding section-units and managed in XML format files.

8/25

System Structure & Algorithms Indices Extraction

From the results of syntactic parsing, postscripts and constituent tree, the indices are being extracted, which include subjects, time, spaces or locations, and actions, for clustering sentences in the next step

The ‘Subjects’ and ‘Actions’ indices of a sentence can’t be empty, whether it is a simple or complex sentence

If a sentence has more than one subjects or verb phrases, all the subjects are saved in the ‘Subjects’ index and the verb phrases are saved with referential subjects in the ‘Actions’ index.

9/25

System Structure & Algorithms Clustering Sentences

Since modern English is a ‘subject-prominent’ language, it is considered that the ‘Subject’ should be used as the first index dimension.

The ‘Time’ dimension is appointed as the second one, as most of events happened and developed in some temporal order.

The ‘Space’s/Location’ takes the third position of the index dimensions

Finally, the ‘Actions’ is considered as the fourth or supplement index.

10/25

System Structure & Algorithms Clustering Sentences

After the indices information for each sentence has been established and the index priorities have been set up, all sentences that have the same or closet ‘Subjects’ index are out in a cluster

They are sorted out according to the temporal sequence, from the earliest to the latest, then the sentences that have the same ‘Space/Locations’ index value in the cluster are marked out

11/25

System Structure & Algorithms Cluster-Filtering

From the defined index priorities, many clusters that focused on different ‘Subjects’ indices are established.

But how can the outstanding clusters from the others be separated?

A new method has been devised to pick out the most outstanding clusters by computing the dispersion of their sizes

First, we rank these clusters by their size from large to small Second, we start the largest cluster, which has the largest number of

words in all clusters, because the largest cluster has to chosen for its importance.

12/25

System Structure & Algorithms Cluster-Filtering

The next question is which cluster should be included as the last, in other words, how many cluster have to be selected.

Following the list of ranked clusters, we will find out a cluster, whose size is the largest one among the clusters whose sizes are below 20% of the largest cluster and we call this cluster as the ‘end-cluster’

Any cluster below the end-cluster in the list of ranked clusters is discarded

If from the largest cluster to the end-cluster there are more than 10 clusters, only the first 10 clusters will be selected and the 10th cluster will be the new ‘end-cluster’.

13/25

System Structure & Algorithms Cluster-Reduction

WordNet is applied to process synonym, antonym, hyponymy and hyponymy in the selected clusters.

Sentences have been compared on phrase level to get rid of some ‘redundant’ information or sentences/clauses.

In order to facilitate the reduction of sizes of the chosen clusters, the positions of sentences has also been taken into consideration.

14/25

System Structure & Algorithms Size-Control

The word count of the output of Cluster-Reduction has been counted

If the word count is over the required size, 100 words, the procedure is replaced by a loop back to Cluster-Reduction and the output is taken as the new input of Cluster-Reduction until the word count is dropped in the zone of 100 20 words.

15/25

Evaluation There were 18 multi-document summarization systems, inclu

ding two systems (system 2 and 3) used as guidelines, involved in the evaluation for Task 2 in DUC2003.

Three human summarizers and 1 model summarizer were also involved in each document set

In order to discriminate four different types of summarizers in the following analysis,

all the 18 multi-document summarization systems will be called as systems or peers;

the system 2 and 3 will be called as guidelines. The 3 human summarizers listed in the column Peer ID together with

the 18 systems will be called as human-summarizers; The 1 model summarizer will be called as model-summarizer

The AMDS_hw was marked as system 6

16/25

Evaluation Evaluation on Mean Coverage and Mean

length-Adjusted Coverage

17/25

Evaluation

18/25

Evaluation

19/25

Evaluation

20/25

Evaluation Evaluations on Counts of Quality Questions with Non-zero answers (CQQN) and

Mean of the Quality Question Scores (MQQS) 12 Summary Quality Questions were asked for the counting of errors in each

system/peer-produced summary About how many gross capitalization errors are there? About how many sentences have incorrect word order? About how many times does the subject fail to agree in number with the verb? About how many of the sentences are missing important components (e.g. the subject, main

verb, direct object, modifier)-causing the sentence to be ungrammatical, unclear, or misleading?

About many times are unrelated fragments joined into one sentence? About how many times are articles (a, an, the) missing or used incorrectly? About how many pronouns are there whose antecedents are incorrect, unclear, missing, or

come only later? For about how many nouns is it impossible to determine clearly who or what they refer to? About how times should a noun or noun phrase have been replaced with pronoun. About how many dangling conjunctions are there? About how many instances of unnecessarily repeated information are there? About how many sentences strike you as being in the wrong place because they indicate a

strange time sequence, suggest a wrong cause-effect relationship, or just don’t fit in topically with neighboring sentences?

21/25

Evaluation Qn in the evaluation were calculated as below:

Qn=0, if NoE=0 Qn=1, if 1 < NoE <5 Qn=2, if 6 < NoE < 10 Qn=3, if NoE > 11

Count of Quality Questions with Non-zero answers (CQQN) and Mean of the Quality Question Scores (MQQS)

MQQS = TQQS/CQQN TQQS = Mean_MQQS=

12

1n

Qn

30

)*(30 MQQSCQQN

22/25

Evaluation

23/25

Evaluation

24/25

Conclusions From the above analysis, the newly devised system,

system 6, showed a good performance The human-summarizers and model-summarizers

are still better than system 6 The newly proposed system is still under

development and this exercise was extremely useful for revealing the need of improvement in the phrase-level comparison and cluster-Reduction module.

25/25

Future Research The exercise has helped in identifying several

areas for improving the performance of system 6: Further analysis on why the performances of

system 6 are so different among the given document sets, are the reasons related with the content or styles of the texts in each document set?

How to increase the number of units, reduce the content redundancy and increase the coverage of each unit in every summary?

a new multi-document summarization system yi guo and gorge stylios heriot-watt university, scotland,...

Documents