textual summarization of scientiﬁc publications and usage ... · textual summarization of...

Textual Summarization of ScientificPublications and Usage Patterns

Aybüke ÖztürkOctober 21, 2012

Master’s ThesisUnder the supervision of:

Dr. Jerry Eriksson, UmeåUniversity, Sweden

Examined by:Prof. Frank Drewes, UmeåUniversity, Sweden

UmeåUniversity

Department of Computing Science

SE-901 87 UMEÅSWEDEN

Abstract

In this study, we propose textual summarization for scientific publica-tions and mobile phone usage patterns. Textual summarization is a pro-cess that takes a source document or set of related documents, identifyingthe most salient information and conveying it in less space than the orig-inal text. The increasing availability of information has necessitated deepresearch for textual summarization within Information Retrieval and theNatural Language Processing (NLP) area because textual summaries areeasier to read, and provide to access to large repositories of content datain an efficient way. For example, snippets in web search are helpful forusers as textual summaries. While there exists summarization tools fortextual summarization, either they are not adapted to scientific collectionof documents or they summarize short form of text such as news. In thefirst part of this study, we adapt the MEAD 3.11 summarization tool [19]to propose a method for building summaries of a set of related scientificarticles by exploiting the structure of scientific publications in order tofocus on some parts that are known to be the most informative in suchdocuments. In the second part, we generate a natural language statementthat describes a more readable form of a given symbolic pattern extractedfrom Nokia Challenge data. The reason is that the availability of mobilephone usage details enables new opportunities to provide a better under-standing of the interest of user populations in mobile phone applications.For evaluating the first part of study, we make use of Amazon MechanicalTurk (Mturk) to validate summarization output.

Acknowledgements

This research project would not have been possible without the support of manypeople. I would like to express my greatest gratitude to the people who have helpedand supported me throughout my project.

I am grateful that Dr. Sihem Amer Yahia and Prof. Marie Christine Rousset whogave me chance to work such an interesting project in their research team. They wereabundantly helpful and offered invaluable assistance, support and guidance. I thankmy internal supervisor Dr. Jerry Eriksson who helped me a lot. Special thanks ofmine to Prof. Dr. Henning Christiansen and Prof. Frank Drewes who gave me valu-able advices for my project report.

I am grateful to my colleagues, Ms. Ruth Garcia, Mr. Shameem Ahamed PuthiyaParambath, and Mr. Behrooz Omidvar Tehrani for their continuous support for theproject, from initial help and through ongoing encouragement to this day.

I wish to thank my parents and friends for their undivided support and interestwho inspired me and encouraged me to go my own way, without whom I would beunable to complete my project. And especially to God, who made all things possible.

ii

Contents

Abstract i

Acknowledgements i

List of Figures v

1 Introduction 11.1 General Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Context of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Multi-document Summarization 72.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Our Proposed Method For Constructing Summaries . . . . . . . . . . . . 12

3 Experiments And Improvements Of The Method 273.1 Experimental Protocol Based On Amazon Mechanical Turk . . . . . . . 273.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Proposed Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Pattern Summarization 354.1 Our Propose Method For Generating Sentences . . . . . . . . . . . . . . 36

5 Conclusion and Future Work 455.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Appendices 47

Bibliography 53

List of Figures

1.1 The SUNFLOWER Project Architecture . . . . . . . . . . . . . . . . . . . . . 41.2 The Nokia Project Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Fuzzy Logic Summarization Architecture . . . . . . . . . . . . . . . . . . . . 112.2 Process of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Example of Irrelevant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Example of Centroid Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 List of Centroid Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Example of Without Keyword Summary . . . . . . . . . . . . . . . . . . . . . 202.7 Example of With Keyword Summary . . . . . . . . . . . . . . . . . . . . . . 212.8 Example of MEAD Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9 Example of Sentence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 232.10 Example Text Before Rephrasing . . . . . . . . . . . . . . . . . . . . . . . . . 242.11 Example Text After Rephrasing . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Qualification Test Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Example of Qualification Test Image . . . . . . . . . . . . . . . . . . . . . . . 293.3 Independent evaluation question . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Comparative evaluation question . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Independent Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Comparative Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . . 323.7 Independent Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . . 333.8 Comparative Evaluation Result . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Taxonomy for Application Attributes . . . . . . . . . . . . . . . . . . . . . . 354.2 Taxonomy for Demographic Information Attributes . . . . . . . . . . . . . . 364.3 Step 1 for Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Step 2 for Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Step 3 for Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Step 4 for Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 424.7 Step 5 for Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 43

.1 MTurk Qualification Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

.2 Independent Evaluation example 1 . . . . . . . . . . . . . . . . . . . . . . . . 50

.3 Independent Evaluation example 2 . . . . . . . . . . . . . . . . . . . . . . . . 50

.4 Comparative Evaluation example 1 . . . . . . . . . . . . . . . . . . . . . . . . 51

.5 Comparative Evaluation example 2 . . . . . . . . . . . . . . . . . . . . . . . . 51

.6 SUNFLOWER Automatic summary example . . . . . . . . . . . . . . . . . . 52

vi

Chapter 1

Introduction

1.1 General Problem Statement

Summaries should produce the most important points of the original text with as fewwords as possible. As enhancement of online information, systems that can automat-ically summarize one or more documents become more desirable. Summarizationprovides a greater flexibility and convenience such as headline news for informing,TV-guides for decision making, abstract of papers for time saving, and in small visiblearea personal digital assistant (PDA) screen can be given as example of summaries.On the other hand, evaluation of the quality of summaries is a very difficult taskbecause summarization has to deal with relevance which is not a clear notion. Peo-ple identify applicable information that they think will be of interest to the readers.Summarization conveys a short form of input document and also reader’s state ofmind. In other words, who the reader is, what his knowledge before reading thesummary consists of, and why he wants to know about the input texts are significantpoint of summarization. Psycho-linguistic and computational-linguistic communitiesagree that modelling the reader’s state of mind is complicated task, if not entirelyimpossible [14].

Many approaches address the problem by building systems depending of the typeof the required summary. Summarization is useful for different purposes and invaried settings:

• Abstractive summaries, the goal is to convey the most important information inthe input and may reuse phrases or clauses from set of related documents, onthe other hand the summaries are overall conveyed in the words of the summaryauthor and requires lot of semantic interpretation and sentence synthesis. Whilesome abstractive systems are designed to updates across different news reportssuch as SUMMONS [31]. This systems aim to present similarities, differences,contradictions, and generalizations among sources of information. Replicationis a difficult task for this systems because they heavily rely on the adaptation ofinternal tools.

• Extractive summaries [1][15] are produced by taking sentences to combine as

they appear in the document or in set of documents. Text extraction meansto identify the most relevant passages in one or more documents, often usingstandard statistically based Information Retrieval techniques augmented withnatural language processing and heuristics. Extractive summarization systemsshould produce higher quality summaries and usually consist of two part. Thefirst part deals with important content selection, and the second part deals withthe presentation of the selected contents. More information about extractivesummarization is mentioned in the next chapter.

• Indicative summaries enable a quick scanning among the search results whichis two or three lines summaries for informing the contents of source documents.Informative summaries are written to provide brief description of the originaldocument to convey an idea of what the whole content of document is all about.

• Keyword summaries, the goal of which is to compose a short text with a set ofsignificative words or phrases mentioned in the given documents. Sentences ex-tracted from the text should be the most important and representative sentences.Extraction of the most important and representative phrases is called keyphraseextractive summarization which constrains the output to phrases that appear inthe document [30].

• Query focused summaries, the goal of which is to summarize the input docu-ment for given specific query. Snippets for search engines is a particularly usefulquery application [2]. Query focused summarization is very similar to questionanswering. This systems provide a summary for documents based on a queryor a question. The generated summary is shaped by the interest of the user. Up-date summaries are sensitive to time that express the recent updates regardingdocuments. It helps audiences or followers to access new information.

• Single document summaries, provide a more compact text that capture theessence of the original content of document. Single document summarizationis a difficult task by itself, but multi document summarization has more dif-ficulties. Multi-document summaries generate a compressed summary from aset of related documents. It simplifies the source search to reduce the time bypointing to the most relevant source documents [44]. More information aboutmulti document summarization is given in the next chapter.

• Summarizing Patterns aim is to provide a textual form translating from naturallanguage which conveyed formal meaning by a given symbolic rule or patternextracted from data by automatic pattern mining techniques.

1.2 Context of the Work

My work has been conducted in two different projects: SUNFLOWER and NokiaChallenge. Short overviews of these projects are given below.

2

1.2.1 The SUNFLOWER

The SUNFLOWER [11] is a system that employs collaborative editing to summarizelarge corpora of literature articles pertaining to a certain topic. As far as we searched,there is no available system like the SUNFLOWER. We believe that it is interestingand challenging to combine automatic summarization techniques with human intel-ligence. To accomplish the project, first related articles are bundled based on contentand article metadata such as authors and citations. The second step each bundleare summarized by extracting key sentences from their constituent articles. The laststep assigns summaries to subject experts according to their skills and helps them tocollaboratively edit and improve the automatically generated summaries. The SUN-FLOWER is developed in collaboration with Bloomsbury Publishing [22], a publisherof scientific material to build an in-house collaborative editing platform. Project ar-chitecture is shown in Figure 1.1 1 which is drawn during the project. A screenshotof the Summarization module from the web interface of our implementation is givenin Appendix.

As given in the Figure 1.1, the project is organised in different steps:

• Bundling:

The SUNFLOWER starts by pre-processing articles using Latent Dirichlet Allo-cation (LDA) [21] to associate to each article a vector of topic weights. In orderto identify different sub-sections in the desired related work output, the SUN-FLOWER uses a document similarity measure that combines content similaritywith extended co-authorship similarity and citation similarity to bundle arti-cles. Then, an agglomerative hierarchical clustering algorithm as in [20] findsbundles. Finally, the SUNFLOWER associates to each bundle a collection ofkeywords describing it using the LDA vectors of its constituent articles. Moredetails about mentioned notions and methods can be found in [45].

• Summarization:

Our contribution is that given a set of bundles and their keywords, the SUN-FLOWER uses an adaptation of an open-source summarization toolkit MEADwhich implements extractive summarization techniques. The well-known struc-ture of scientific publications allows us to experiment with summarizing scien-tific articles but also parts of articles such as their abstract, introduction andrelated work sections, where contributions tend to be formulated. In addition,since each bundle comes with a set of keywords, they can be used to bias thesummarization towards those sentences that represent best the keywords. Theoutput of this step is a set of bundles, their summaries and their keywords. Wedescribe our summarization process with more details in the Section 2.2.

1 The SUNFLOWER Project Architecture is taken from [11] by kind permission of the authors

• Collaborative Editing:

An environment is created where skilled workers are associated with bundlesfor editing. Matching between workers and bundles is done using their respec-tive skills and sub-categories. An assignment heuristic is then used to optimizesome objective function. Examples of objective functions include minimizingidle time of workers and balancing the number of edits across bundles. Theformalization of the collaborative model along with the objective functions giverise to a family of efficient assignment heuristic. Please refer to [43] for moreinformation about the methods for task assignment and collaborative editing.

Figure 1.1: The SUNFLOWER Project Architecture

1.2.2 The Nokia Challenge

The availability of mobile phone usage details and user demographics enables newopportunities to provide a better understanding of the interest of user populations inmobile phone applications. We have participated in a pattern mining project over adata set provided by the Nokia about phone usages by users for which demographicinformation is available. Frequent patterns give lots of encapsulated information. Atthe same time, there are some challenges which limits the usability of a frequent pat-tern. One challenge is the number of patterns can be millions and patterns with many

4

items are hard to read. Another challenge is the analyst may be unable to understandthe meaning of patterns with many items. The fundamental goal of the project is toprovide the analyst with an interactive exploration framework for frequent patternsand translate these patterns more readable form.

The steps defined in this project are:

• Pattern mining is used to discover hidden dependencies between applicationsand their usage. Pattern attributes correspond to values of user demographics,such as Young and Female as well as applications like Desktop and Calendar.

• Given the large space of possible patterns, an interactive framework is proposedbased on usage-based primitives that helps to explore the space of discoveredpatterns by abstracting and refining them on demand. Abstraction is proposedto abstract attributes in a pattern that leads to more readable patterns in whichthe analyst can find the semantics more easily. Refinement is proposed to findattributes that are the most characteristic of sets of users, according to a saliencymeasure, and present these attributes through visualization. Please refer to[32] for more information about the methods for pattern mining, abstraction,refinement.

• Within this report, our contribution is to generate natural language statementsthat describe pattern in a more readable form. Figure 1.2 2 shows the architec-ture of project which is drawn during the project.

Figure 1.2: The Nokia Project Architecture

2The Nokia Project Architecture is taken from [32] by kind permission of the authors

1.3 Outline of the Thesis

This report is organized as follows:

• Chapter 2 presents the various existing techniques of multi document summa-rization and focuses on the summarization tool MEAD that we have used inorder to implement a tool for summarizing a set of scientific documents.

• Chapter 3 describes experiment protocol and settings, and discusses how analysingthe experimental results has guided us for trying to improve them and hownovel experiments have validate the improvements.

• Chapter 4 describes the method of sentence generation for symbolic patterns.

• Finally, Chapter 5 presents future plans and the conclusion of this report.

6

Chapter 2

Multi-document Summarization

Automatic summarization is defined as the creation of a shortened version of a doc-ument or a set of documents process by computer program [46]. There are manytechniques to summarize a document and these techniques can be adapted for a setof documents (multi document) summarization [4][7]. We overview some of themin Section 2.1. Most of the existing works have deal with summarization of textualnon technical documents such as newspaper articles. In our work, we have studiedhow to use and adapt an existing tool MEAD to build a prototype for summarizinga bundle of scientific articles. We explain our approach in Section 2.2.

2.1 Related Work

Summarization techniques has been studied and discussed as a research subject sincethe publication of Luhn’s paper [5]. Firstly, Luhn stemmed words to their root formsand deleted stop words. After that, he compiled a list of content words sorted bydecreasing frequency. The index of list provides a significance measure of the word.A significance factor was derived that reflects the number of occurrences of signifi-cant words within a sentence. All sentences are ranked in order of their significancefactor. The top ranking sentences are finally selected as summary [6].

Baxendale [33] worked on position feature which has been used in many complexmachine learning based systems. He analysed 200 paragraphs to find that in 85%of the paragraphs the topic sentence came from the first sentence and in 7% of thetime it was the last sentence. So that, positional feature accurate way to select a topicsentence would be to choose first or last sentence of documents. Edmundson [34]describes a system that produces document extracts. The two features of word fre-quency and positional importance were incorporated from the previous two works.Additionally, one more feature was used which checks whether the sentence is aheading or title. Weights were attached to each of these features manually to scoreeach sentence.

Unsupervised methods for sentence extraction are the essential subject in extractivesummarization because they do not require any external sources, models or on lin-guistic processing and interpretations. Last fifty years, machine learning techniqueshave been successfully applied to summarization. First method Naive-Bayes [17] isused for query-focused multi document summarization systems to categorizes eachsentence as worthy of extraction or not. Naive-Bayes classifier described by Kupiec etal. [36] based on system of Edmundson [34] and two new feature is introduced, sen-tence length and the presence of uppercase words. The assumption, the employedfeatures are independent of each other given the class. Each sentence was given ascore according to (1), and only the n top sentences were extracted.

The classification probabilities are learnt statistically from the training data. Let Sthe set of sentences that generated as a summary, where s is a sentence from the doc-ument collection, and F1, F2,..Fk the features are used in classification. Below formulais the probability that sentence s will be chosen to form the summary given that itpossesses features.

P(s ∈ S|F1, F2, ..Fk) =

k

∏i=1

P(Fi|s ∈ S)P(s ∈ S)

k

∏i=1

P(Fi)

The results from Naive-Bayes method experiment for sentence selection show thata combination of location of the sentence, word frequency, the presence of uppercasewords and sentence length gave the best results for single-document summarization.

Aone et al. [35] also combined a naive-Bayes classifier with TF*IDF (term frequency,inverse document frequency) feature. TF value is number of times a word appearsin all documents divided by total number of words in all documents. IDF value iscalculated as the logarithm of the number of documents divided by number of docu-ments where the word appears [13].

t f id f (tj) =D(tj)

|D| ∗ log( CC(tj)

)

According to formula, C is the number of documents in a collection, C(tj) is thenumber of documents containing term tj, |D| is the total of all words in the documentand D(tj) denotes how many times tj occurred in document D.

Another method is based on Hidden Markov models (HMM) [16][37] which isused for single document summarization. The essential reason for using a sequentialmodel is to account for local dependencies between sentences. The probability that

8

sentence s is in the summary is independent of whether sentence s-1 is in the sum-mary is not assumed for HMM. As we mentioned, naive Bayesian methods assumesthe independence of features, but in HMM assumption, a joint distribution of thefeatures set is used.

Lastly well known techniques is Graph-based method [18] which is used for findingsimilarities and dissimilarities in pairs of documents. The importance of a sentenceis determined by computable features using cosine similarity matrix where each en-try in the matrix is the similarity between the corresponding sentence pair. Afterremoving stop word and stemming, sentences in the documents are represented asnodes in an undirected graph. There is a node for each sentences. Two sentences areconnected with an edge if the two sentences share some common word. That meansTF*IDF cosine similarity is above some threshold. In this way, word frequency playsa direct role in determining the structure of the graph.

LexRank is a system [3] which use Graph-based methods for summarization. Au-thors calculated modified idf cosine similarity to use in graph based method. Algo-rithm 1 summarizes how to compute LexRank scores for a given set of sentences [3].According to [38], Lexrank is not practical for multidocument summarization of sci-entic papers. In addition, LexRank is a sophisticated and computationally expensivemethod and it extracts almost the same sentences with the baseline MEAD Originalmethod. These are the major reasons why we did not adapt LexRank to our systemto propose a method for building summaries of a set of related scientific articles.

In NLP approach, text summarization is implemented based on fuzzy logic [29]. Indepth each feature of a text such as sentence length, location in the document, sim-ilarity to keyword are mentioned in the next section as the input of fuzzy systems.Fuzzy logic system is designed according to selection of fuzzy rules and membershipfunction. The performance of the fuzzy logic system is effected the selection of fuzzyrules and membership functions.

The fuzzy logic system consists of fuzzifier, inference engine, defuzzifier, and thefuzzy knowledge base. In the fuzzifier, inputs are translated into linguistic valuesusing a membership function to be used to the input linguistic variables. After fuzzi-fication, the inference engine refers to the rule base containing fuzzy IFTHEN rulesto derive the linguistic values. In the last step, the output linguistic variables fromthe inference are converted to the final values by the defuzzifier using membershipfunction for representing the final sentence score.

In order to implement text summarization based on fuzzy logic, first step is thatthe features are used as input to the fuzzifier. Then, the input membership functionfor each feature is divided into different fuzzy set such as important values high (H)and very high (VH). In inference engine, the most important part in this procedureis the definition of fuzzy IF-THEN rules. The important sentences are extracted from

Algorithm 1 Computing LexRank Score for SentenceInput: An array L of t sentences, cosine threshold mOutput: A array S of LexRank scores

ArrayCosineMatrix[t][t]Array Degree[t]ArrayS[t]for i← 1, t do

for j← 1, t doCosineMatrix[i][j] = id f − cosine(L[i], L[j])

if CosineMatrix[i][j] > m thenCosineMatrix[i][j] = 1Degree[i] + +

elseCosineMatrix[i][j] = 0

end ifend for

end forfor i← 1, t do

for j← 1, t doCosineMatrix[i][j] = CosineMatrix[i][j]/Degree[i]

end forend forReturnS

these rules according to selected features. Sample of IF-THEN rules are shown as thefollowing:

IF (SentenceLength is H) and (TermFreq is VH) and (SentencePosition is H) THEN(Sentence is important) Likewise, the last step in fuzzy logic system is the defuzzifi-cation.

The output membership function which is divided into three membership functionsis used. Those are Unimportant, Average, and Important to convert the fuzzy resultsfrom the inference engine into a output for the final score of each sentences. Afterthat, a value from zero to one is obtained for each sentence based on the selectedsentence features in the output. The obtained output value determines the degree ofthe importance of the sentence in the final summary. The architecture of Fuzzy LogicSummarization is drawn based on [29] to show fuzzy logic system.

As we mentioned, multi document summarization differs from single in that theissues document selection, compression and redundancy are critical in the formationof useful summaries [10]. Based on these information, we can discuss about howsentences are extracted for multi document summarization.

10

Figure 2.1: Fuzzy Logic Summarization Architecture

The main difference between single and multi document summarization are thefollowing:

• Finding a group of documents written about the same topic is much harderthan work on a single document.

• The size of the summary with respect to the size of the document set is muchmore smaller for multi document set than for single document summaries.

Both for single and multi document summarization, the co-reference problem ismajor issue [15]. The essential problem is to find the sentences that are actually im-portant enough to be included in a general purpose summary. Many requirementsare needed for multi document summarization. For instance, the summary should re-flect essential points of documents but also should minimize redundancy. Summariesshould be relevant and readable to the user and should outline related information.Finally, it is important that summaries should presents the most relevant and diverseinformation first so that the reader gets the maximal information content even if theystop reading the summary.

On account of today’s technology, information is increasingly being produced indigital formats. As a consequence of that the need of automatic text summarizationraises in recent years. Specifically, the study of multi-document summarization be-comes popular to make and share knowledge in an appropriate way such as buildinga related work section for an article. A number of multi-document summarizationsystems have been developed to help users in getting an overview of a set of arti-cles. The most well-known example is MEAD summarization tool. There are someother free systems available as well [23][24][25]. For example SweSum [26] mainlybeing a Swedish language text summarizer and EstSum [27] for Estonian newspapertexts summarizer. These systems are typically evaluated with short documents suchas newspaper. The main reason behind this is the lack of a publicly available large

collection of scientific articles with ideal summaries for document collection [28].

MEAD can be used for single document summarization and for multi documentsummarization (clusters of related documents). It takes an input documents in tex-tual form only. All data in MEAD is stored as XML. It computes for each sentencea score combining different scores depending on features that have to be selected byusers of MEAD. It provides as an output summary the sentence having the highestscores. MEAD combines many summarization methods such as SimWithFirst feature,computes cosine overlap with the first sentence in the document (or with the title, ifit exists). QueryOverlap cosine overlap with a query sentence or phrase. MEAD alsoincludes two baseline summarizers, lead based and random based. Lead based sum-maries are produced by selecting the first sentence of each document, then the secondsentence of each until the desired summary size is reached. A random summary con-sists of enough randomly selected sentences from the cluster to produce a summaryof the desired size. We use Graph-based summaries in our system which is moreappropriate for article summaries. MEAD has been primarily used for summarizingdocuments in English.

The settings in MEAD that can be set by the user are the following:

• minimum sentence length (number of words) that will be included in a sum-mary.

• how many sentences the output summaries will be made of (defined as a per-centage)

• the processing of provided keywords to choose the sentences to put in the sum-maries.

• the weights to take into account in the combination of the scores based on theabove features.

2.2 Our Proposed Method For ConstructingSummaries

This section discusses our current implementation of a multi-document summariza-tion system which is designed to produce summaries for scientific articles. To exam-ine the current multi-document summarization methods on scientific topic summa-rization, articles have been extracted from the arXiv which is an open digital library[9]. In total we obtained 754,774 articles classified into 7 large groups: Physics, Mathe-matics, Computer Science, Statistics, Qantitative Biolgy, Qantitative Finance and Non-linear Science. For our experiments, we have used only the computer science relatedarticles, which are 19,937 and we performed experiments using MEAD.

12

Figure 2.2: Process of Summarization

The process of summarization is done in four parts shown in the Figure 2.2.

• The first part of the process is a preprocessing step of converting and cleaningdocuments to provide to MEAD. Each pdf article is converted into text formatto which some cleaning rules are applied to remove irrelevant pieces of text forMEAD.

• The second part of the process is MEAD specific. It assigns scores correspond-ing to selected MEAD features by calculating score of sentences.

• The next part of the process extracts sentences by scores and information oforigin of sentences. This process is carried out in MEAD.

• Rephrasing summaries are postprocessing step of our system to make sum-maries more readable for people.

Each of these steps are described in the following sections.

2.2.1 Document Conversion And Cleaning

Before starting cleaning, we convert pdf into text format using ps2ascii. The mainreason is that MEAD does not support pdf format.

As a result, we encounter some text as given in the Figure 2.3. At this point, weprovide an overview of the problems to be addressed by document cleaning and their

solutions. A document cleaning approach should satisfy several requirements in ourexperiments.

The requirements defined in this experiments are:

• First of all, it should detect and remove piece of text which is not meaningfulgrammatically. For instance, text can not consist of repeated letters more thantwo as given some example in the Figure 2.3.

• Second, it should detect and remove all mathematical symbols if they appeartogether as formulas in text, tables or in figures.

Figure 2.3: Example of Irrelevant Data

We have used different samples from different texts to build document cleaningrules, as illustrated in the Figure 2.3. The building of the rules involve followingsteps:

• Conversion of articles into individual sentences.

• Replace special characters with space

• Remove sentences containing any of below:

– references, figures, section titles, tables, acknowledgement

– lemmas, algorithms, equations, Greek letters

14

As a result, the rules are applied to each sentence and the output of this step isa text-only document. In addition to that, we do not remove some part of text butwe make use of special MEAD feature to get rid of irrelevant part of articles. If asentence consists of less than 9 words, MEAD feature does not extract that sentence.

2.2.2 Computation Of Scores For Sentences

For each sentence, several scores are computed by MEAD depending on some fea-tures chosen by the user. These scores are then combined into the final score of theeach sentence. As we mentioned earlier, we have chosen to use four of MEAD’sfeatures that are judged important in [8], the position of the sentence in the article,number of words in the sentence and how many times a word appear in an articleindicate sentence importance. We explain now how the scores corresponding to eachfeature are computed by MEAD.

Position Feature

The first feature called position feature is the relative position of a sentence in adocument such that the first sentence gets the highest score. It is applied separatelyto each document. Algorithm 2 shows the calculation of Position score. For instance,In the figure 2.8, first article consists of 5 sentences. The position score of sentence 5is 0.447214 as calculated in equation 2.1.

√15= 0.447214 (2.1)

Algorithm 2 Position Featurefor each document do

for each sentence doif sentence then

Position← sqrt(1/position o f sentence)else

Position← 0end if

end forend for

Centroid Feature

The second feature called Centroid is a measure of the centrality of a sentence to theoverall topic of a set of documents. A centroid is a group of words that statisticallyrepresent a set of documents. As such, centroid could be used both to classify relevantdocuments and to identify salient sentences in a set of documents. Centroid valuecan be calculated for words or for sentences in a set of articles. The centroid value fora word is computed as the TF*IDF values of that word.

Figure 2.4: Example of Centroid Feature

Centroid words are selected as those for which the centroid values are above somethreshold. In our system, we have set the threshold to 3 (the default value proposedby MEAD). In fact, if there are not enough words for which the TF*IDF values areover the threshold, MEAD takes the first 8 * (number of document) words as centroidwords. (8 is default number by MEAD).

After calculation of centroid value for each word, the centroid value for each sen-tence is computed as the sum of the centroid values of the centroid words in thesentence. MEAD finds which sentence has the highest centroid score among allsentences. That sentence is returned as the Centroid sentence for the whole set ofdocuments. For each sentence, its centroid score is a normalized score obtained bydividing its centroid value by centroid value of the centroid sentence. Figure 2.4 dis-plays centroid scores for sentence coming from different texts (identified by a numberin the column 1). The sentences are membered by their order in the text they comefrom (identified by a number in the column 2). The different feature scores are com-puted for each sentences in a set of documents (identified in the column 3,4 and 5)

Centroid algorithm is given in the Algorithm 3. We can see an example to clarifycentroid feature calculation resulting is the Figure 2.4.

• The first step is that MEAD computes TF*IDF values for each word in the set ofdocuments : "1.txt", "2.txt", and "3.txt".

16

• In the second step, MEAD counts document number and assigns 8 * 3 = 24for required number of centroid words. And then, it constructs the centroidwords of the set of documents by taking the words that are above the thresholduntil the desired size of Centroid words. Centroid words are given in Figure2.5. In our example, there are returned words which is less than 3 score tocomplete required number of words such as " Sample" 2.90242081166701 and"statistically" 2.73214560374501 are less than threshold value.

• In the third step, MEAD computes the centroid value for sentences. We illus-trate second sentence of document "3.txt": " This paper compares two differentways of estimating statistical language models." In this sentence, models, es-timating, statistical and ways obtain centroid words to compute the centroidvalue for sentence.

(models) 8.42047422093794(estimating) 5.34274350296071(statistical) 3.28711826697865(ways) 2.4411260596956

8.42047422093794 + 5.34274350296071 + 3.28711826697865 + 2.4411260596956 =19.491462050573

• In the next step, MEAD finds the Centroid Sentence of the whole set of docu-ments. In our example, the second sentence of document "1.txt" is the CentroidSentence (the highest score of centroid value for sentence among all sentences).

The Centroid Sentence score is: 37.1032329841665

• the last step is that the final score of sentences is normalised. For our examplesentence, its normalized centroid score is:

= 19.491462050573 / 37.1032329841665= 0.525330557013477

• As shown in the Figure 2.4, centroid sentence of the whole set of document is"1.txt" second sentence. Its normalized centroid score is 1.0000. Our examplesentence is "3.txt" second sentence. Its normalized centroid score is 0.525331.

Figure 2.5: List of Centroid Words

Keywords feature

Next feature is keyword based feature which boosts up the scores of the sentencescontaining keywords of interest. Summarized text should preserve the key ideas de-scribed in the article bundles. This is achieved using the keywords associated withthe bundle. Keyword represents the most important words within the bundle articles.Keywords for a bundle is extracted from the topic distribution associated with theconstituent articles and corresponding word distribution associated with the topics.We first explain how the keywords are obtained from the topic and word distribu-tions obtained as a result of the LDA.

LDA is a generative probabilistic model for documents. The basic idea behind LDAis that documents are composed of random mixtures of latent topics, where eachtopic is characterized by a distribution over words. LDA is based on the followingassumptions: word distribution is a multinomial distribution, topic distribution is amultinomial distribution, topic weight distribution is a Dirichlet distribution, worddistribution per topic is a Dirichlet distribution.

18

Mathematical model behind LDA is given here. Let P(d) be the probability ofchoosing a document d , P(t|d) is the conditional probability of choosing a topic tgiven the probability of choosing the document d and P(w|t) is the probability ofchoosing the word w given the probability of selecting the topic t. The join probabil-ity distribution of the observed variables (d,w) is P(d,w) = P(d)P(w|d). By Bayes rule

P(d,w) = ∑ P(t)P(w|t)P(d|t)

In our work, for each bundle, the topic distribution vectors are summed up for themember articles and corresponding fields of the resulting vector is multiplied to theword distribution. Let us illustrate it an a very simple example.

Consider a bundle

bundle = <article 1, article 2>

topic distribution of article 1 = <topic 1: 0.6, topic 2: 0.3, topic 3:0.1>topic distribution of article 2 = <topic 1: 0.3, topic 2: 0.5, topic 3: 0.2>

The resulting topic distribution vector for the bundle = <topic 1:0.9,topic 2: 0.8,topic 3:0.3>

The word distribution of topic 1 = < word11: 0.7, word12: 0.3>The word distribution of topic 2 = <word21: 0.5, word22:0.5>The word distribution of topic 3 = <word31: 0.2, word32:0.8>

topic dist * word dist = < word11:0.63, word12:0.27 , word21: 0.4, word22: 0.4,word31: 0.06, word32: 0.24>

if we selected top 4 words as keywords, we obtain:

<word11:0.63, word 21:0.4, word12:0.27, word32: 0.24>

As an example, the keywords obtained for a bundle of article about statistics andmachine learning are : "training", "classification", "prediction", "learning". For calcu-lating keywords feature score: for each sentence, if one of its word matches with akeyword, it assigns 1 to that sentence, and 0 otherwise. MEAD works as follows.

Figure 2.6 and 2.7 illustrate the impact of the keyword feature on the resulting sum-maries. In Figure 2.6 shows the summary provided by MEAD for a given bundlewithout using the keyword feature. The Figure 2.7 shows the summary when key-word are used. Note that in the Figure 2.6, the last sentence of summary containsthe keyword "classification" is located at a better place in the Figure 2.7 summary (infourth position).

Figure 2.6: Example of Without Keyword Summary

Length Feature

The last feature called Length which is number of words in a sentence. Length fea-ture is cut off feature. Threshold length is 9.

FinalScore={

Position + Centroid + Keyword(QueryPhrase) if 9 < sentence length0 otherwise

For instance, in the Figure 2.8, the length of the first sentence of document "cs0008028.txt"is 6 and the final score for that sentence is assigned 0. The fifth sentence length is26 in the document "cs0008024.txt" and the final score is calculated for that sentence is

20

Figure 2.7: Example of With Keyword Summary

Position score: 0.447214Centroid score: 0.172739Keyword score: 1.0000

0.447214 + 0.172739 + 1.0000 = 1.619953

The result is a table in which the different scores are grouped by articles. Figure2.8 is an example of such a table. Note that the keyword feature has a strong impactin the final score because all sentences in the documents have score for Position andCentroid even if keyword feature is 0. On the other hand, Keywords feature providesa rapid rise of the sentence, if a sentence keyword score is 1.

Cosine similarity is a technique in Novelty Track in TREC 2002 [39] to compute sen-tence similarity as novelty re-ranker. Author noticed that human judges often pickclusters of sentences, whereas MEAD normally does not care about the spatial rela-

Figure 2.8: Example of MEAD Score

tionships between sentences within a document. She added new characteristic whichboosting a sentences score slightly if the previous sentence had a relatively high score.This calculation continue until it has seen every sentence in the set. Before The defaultre-ranker in MEAD used the cosine similarity between already selected sentences inthe summary and the new sentence which is under consideration. Similarity betweentwo sentences is measured with the given below formula. We use novelty re-rankerafter feature calculation step.

CoSim(s1, s2) = cos(θ) = s1∗s2|s1||s2|

2.2.3 Sentence Extraction

The percentage of extracted sentences is a parameter to be set in the use of MEAD.We produce summaries that are 20 percentage the number of sentences of a set ofdocuments in a bundle. For Example, if a bundle consists of 100 sentences, 20 sen-tences will be extracted as a summary.

While extracting sentences, MEAD preserves

• the order of documents in the bundle.

• the order of sentences in each document.

Figure 2.9 displays sentences coming from different texts (identified by a number inthe column 1). The sentences are numbered by their order in the text they come from(identified by a number in the column 2) and the final scores are for each sentencesin a set of texts (identified in the column 3). Even if the highest score of sentencecomes from article 1, because of the order of articles, first extracted sentence comesfrom article 3.

22

Figure 2.9: Example of Sentence Extraction

2.2.4 Summary Rephrasing

Sentences often contain adverbial clauses, which lose their references when extractedout of context. It is specially the case for both in the single and in the multi-documents. We have designed a post processing step in order to locate this problem.Furthermore, another aim of this step is to give the authors names and the publica-tion title from every document in a bundle. To do this, MEAD keeps track of wheresentences come from. We use this information to rephrase sentences to enhance thereadability of the summary.

We start by conveying the sentence using already extracted sentences. We keep theauthors names and the title before in the first sentence of the each part of summary.In addition to that, we did some replacement. If there is any word such as "we"which refers authors, and the paper has one author, we do not know if the authoris male or female so we replace with author. If the paper has two or more authors,we replace with authors. For the following step, the adverbial clause is replaced witha proper word which conveys the sentence importance. This replacement does notmake the sentence ungrammatical. The rephrasing rule is that we check the uppercase of adverbial clauses. If adverbial clause starts with upper case and have comma,we remove from sentence or replace with proper word which shows the important ofsentence.

As you see in Figure 2.10 and 2.11, we add title and authors of article informationbefore giving extracted sentences from each article. The Rephrasing method is alsoused to classify the sentences connection that appear in a summary. Our experimentsshow that the proposed approach outperforms for both summary quality and fluency.Replaced words list is illustrated in Appendix.

Figure 2.10: Example Text Before Rephrasing

Figure 2.11: Example Text After Rephrasing

24

Algorithm 3 Centroid Feature for SentenceInput: An array S of n sentences, cosine threshold tOutput: A array C of Centroid scores of sentence

Count = 0Maximum Score = 0Compute t f ∗ id f score f or each wordfor i← 1,n do

for each word w o f S[i] dot f id f (w) = t f (w) ∗ id f (w)

end forend forConstruct the centroid words o f the set o f documents bytaking the words that are above the thresholdfor each word w o f t f id f (w) do

if t f id f (w) > tor Count > 8 ∗ (document size) thent f id f (Centroid)(w) = t f id f (w)Count ++

elset f id f (Centroid)(w) = 0

end ifend forCompute the score f or each sentencefor i← 1,n do

C[i] = 0for each word w o f S[i] do

C[i] = C[i] + t f id f (Centroid)(w)end for

end forCompute the Centroid sentence o f documentsfor i← 1,n do

if C[i] > Maximum Score thenMaximum Score = C[i]

elsecontinue

end ifend forfor i← 1,n do

Final Score C[i]← (C[i]/Maximum Score)end for

Chapter 3

Experiments And Improvements Of TheMethod

Our goal is to analyse the summarization of scientific publications using MEAD sum-marization methods and try to improve summarization result. To do this, we evaluateautomatically generated summaries using Mturk. In order to evaluate the impact ofusing the keyword feature, we have computed two categories of experiments: anindependent evaluation, in which we check quality of summaries; a comparativeevaluation is to compare summaries obtained using keyword feature from those notusing keywords.

3.1 Experimental Protocol Based On AmazonMechanical Turk

Even if technology changes everyday, human beings can do some tasks much morebetter than computers. For example, identifying objects in a photo or video, research-ing data details. Mturk is a crowdsourced marketplace for tasks that requires humanintelligence. Basically, a person who is requester, needing work done can set up aHIT (human intelligence task) which is a small task. A person does this simple taskin exchange for a tiny payment as a worker. Each worker would see thousands ofindependent pieces of task each day in Mturk web page [12]. This web page showshow much a task is paid and how long it will take to do each task. Once a worker isdone, requester has ability to review the results and accept or reject them. They onlypay for accepted work. If special skills are required to complete a task, requester canneed that workers pass a qualification test before they are allowed to work on givenHITs. There are different HITs such as comparison of given documents, translate onelanguage to another language, identify duplicate entries and verify item details, findspecific fields or data in large documents. We have chosen to use Mturk for our ex-periments because the evaluation of quality of summaries is much simpler to do byhumans than by machines.

3.2 Experimental Setting

3.2.1 Qualification Test

The biggest challenge for using MTurk is how to decide whether a particular worker’sanswer is correct. In other words, how to check worker’s background knowledge thatis enough to work on a given task. We have designed a qualification test to measureknowledge of workers before work on our experiments. As you see in Figure 3.1, ourqualification text is knowledge identification by using image.

The reason behind selection of image kind of question is, workers cannot find ananswer only searching the question text in web. Background knowledge is neededin order to find an answer of given question. Moreover, there are some restrictionsthat each user will be given one attempt to solve the qualification test and test shouldcomplete within five minutes.

Our experiment qualification test is given in Figure 3.1 and 3.2 for graph theorytopic: We ask users to identify a well-known Hamiltonian path in undirected graphfrom picture. We prepare different qualification test in order to each topic which isselected for experiment. Other qualification test is illustrated in Appendix.

Figure 3.1: Qualification Test Example

28

Figure 3.2: Example of Qualification Test Image

3.2.2 Amazon Mechanical Turk Setting

We use Mturk survey questionnaire for workers to answer the given questions. Foreach topic we use 15 workers. We assign 0.3 $ per task. We have used only the com-puter science related articles. Among these article we select three different topics:Graph theory, Statistics and Machine Learning, and Information Retrieval.

To check quality of summaries, we use independent evaluation. We give summariesand complete articles as a link to the workers and ask " Does the given summarymake sense with respect to the articles ". Question is shown in the Figure 3.3. Weexpect yes, if summary reflects well the content or no, if summary does not reflectcontent of articles. Moreover, we expect some feedback to accept given answers.

Figure 3.3: Independent evaluation question

In comparative evaluation, we give two summaries and full articles as a link to theworkers. One summary is obtained from our tool when MEAD is used without thekeywords feature obtained. The other summary is using the keywords feature. Weask to workers " Which summary reflects better the content of the given articles". We

put summary 1 and 2 button instead of yes and no button for those questions. Work-ers select one summary depending on their preference and give some feedback. Theinterface with the verifies concerning comparative evaluation is illustrated in Figure3.4.

Figure 3.4: Comparative evaluation question

3.3 Experimental Results

We obtain our experiment result from average of the three different topics results.They are summarized in the Figure 3.5 and 3.6.

For independent evaluation result is:Statistics –> % 53.3Machine Learning –> % 60Information Retrieval –> % 80Overall ratio: % 64.4

For comparative evaluation result is:Statistics –> % 53.3Machine Learning –> % 60Information Retrieval –> % 53.3Overall ratio: % 64.4

30

The independent evaluation shows that summaries are not good. We also get sim-ilar results for comparative evaluation part. With keywords and without keywordsthe quality of summaries is not significantly different.

Workers give some feedback while solving questions. We use this information toimprove quality of summaries. Some comments from workers are:

• The summary has a lot more information as compared to the respective articles.The summary has many differences with the articles.

• The summary is not accurate and is not good.

• It was a good decent job. Summary is not good enough.

• as both of them are somehow same so it is not possible to say which one isbetter so answer of this is no.

Figure 3.5: Independent Evaluation Result

3.4 Proposed Improvements

Taking advantage of the comments, we believe that we can get better results. Espe-cially, we see that articles are so long and they do not reflect content of articles. Forindependent evaluation, we summarize only title and abstract to improve indepen-dent evaluation test result. At the end of the experiment, we get better result forindependent evaluation. The question is, " We gave a set of articles and a piece of

Figure 3.6: Comparative Evaluation Result

text. We ask to evaluate if the given text summarizes well the scientific content of thegiven set of articles and does the summary reflect well the scientific content of articles". We expect yes or no answer with comments.

For comparative evaluation improvement, we use 15 keywords instead of 10 key-words and also again only title and abstract parts are given as a text. We also getbetter result for comparative evaluation. The results are founded in Figure 3.7 and3.8. Other examples are given for both independent and comparative evaluation inthe Appendix.

For independent evaluation result is:Statistics –> % 80Machine Learning –> % 80Information Retrieval –> % 86.6Overall ratio: % 77.7

For comparative evaluation result is:Statistics –> % 66.6Machine Learning –> % 86.6Information Retrieval –> % 80Overall ratio : % 82

Improvement results show that using only title and abstract give much better resultthan the first result. As we mentioned, compression ratio, the size of the summarywith respect to the size of the document set is important point for multi document

32

summarization. According to [40], 20% or 30% of the source provides a reasonableinput set for the summary of 10 to 20 sentence news. Concepts in the sentences arenot taken completely out of context. Also, the extracted sentences are still themati-cally connected. On the other hand, in scientific articles, the compression rates haveto be much higher. For instance, abbreviating a 10 page scientific article to a halfpage summary requires a compression to 5% of the original. In this point, the prob-lematic fact that sentence selection effects a qualitative difference because it is contextinsensitive. If only one sentence per one page is selected, all information about theextracted sentences is lost. Furthermore, abstract is highly beneficial in several infor-mation acquisition tasks. As mentioned in [10][41], abstracts have several advantagessuch as abstracts promote current awareness, save reading time, facilitate selection,and improve indexing efficiency. In addition to that, increasing given set of keywordsalso improves the result of comparative evaluation. As we mentioned in related worksection, Luhn found that significance factor was derived that reflects the number ofoccurrences of significant words within a sentence.

Figure 3.7: Independent Evaluation Result

Figure 3.8: Comparative Evaluation Result

34

Chapter 4

Pattern Summarization

The Nokia dataset consists of data from smart-phones of 38 participants in the courseof more than one year. For each user, all records of different sensors like applicationusage, GPS, etc. are available. This dataset also includes answers to a questionnairewith 17 questions answered by some users in the experiment. Demographic attributeslike gender, age group, profession, etc. come from this questionnaire. Applicationusage records consist of an application id and a time-stamp of when it was used.After removing some system applications, 170 applications are ended up.

Figure 4.1: Taxonomy for Application Attributes

There are different kinds of attributes in the dataset, half of which are dependentand the other half independent. A dependent attribute does not convey any meaningalone and should be attached to an independent attribute. Time attribute is depen-

Figure 4.2: Taxonomy for Demographic Information Attributes

dent usage attributes. For instance, Morning alone does not have any meaning, onthe other hand, when we say Web-Morning it means that the user has used the appli-cation Web in the morning. The two independent attributes in Nokia are applicationusage and demographic information. For each attribute, we created a taxonomy.Figures 4.1, and 4.2 show the taxonomy for application usage, demographic informa-tion. For instance, in the Figure 4.1, common applications like Calendar are childrenof desktop applications. In the Figure 4.2, working full-time is a child of working andworking itself is a child of social activity.

4.1 Our Propose Method For Generating Sentences

Inputs are patterns that have been automatically discovered by a pattern mining al-gorithm. A pattern is a set of attribute that are encountered frequently together inthe dataset. These attribute can be categorized in different classes as we mentionedabove.

• The first class groups attribute that are related to user information, such as"Studying full time" or "Is Male".

• The second class groups attribute that are related to the application used suchas "Messengers", "Web" or "Contacts".

• The third class groups attribute that are related to period of time the applicationused such as "Morning" , "Afternoon" or "Night".

36

Before starting sentence generation, we can look at it following pattern examplesand generated sentences for those patterns. First pattern example consists of user in-formation, the application used, and period of time the application used. In contrastto this example, second pattern example does not have user information.

Example Pattern 1: Studying full time, Is Male, Carousel_Morning, Contacts_Night,Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend,Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,Messengers_Weekday, Contacts_Afternoon, Contacts_Noon, Contacts_Weekend,Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

Example Sentence 1: Males who study full time use Contacts and Bluetooth at anytime, Carousel in the morning and afternoon, Web at noon and night and never useMessengers in the morning.

Example Pattern 2: ActiveSearch_Weekday, ActiveSearch_Weekend,Bluetooth_Weekend, Bluetooth_Weekday, Contacts_Morning, Web_Weekday,Web_Noon, Contacts_Night, Web_Morning, Web_Weekend, Web_Afternoon

Example Sentence 2: People use ActiveSearch and Bluetooth at any time, Contactsin the morning and at night, and never use Web at night.

Our sentence generation process can be summarised as follows:

• The first step of the process deals with splitting user info and application withtime info into two parts.

• The second step of the process combines time info for each applications andreplace with meaningful time information which reduces length of sentenceand makes significant of sentence.

• The third step of the process combines applications that are in same time andmakes time order for each applications.

• The fourth step of sentence generation process adds prepositions.

• The last step of sentence generation process adds punctuations.

We explain sentence generation from given example pattern 1 step by step as fol-lows:

input : pattern

Example pattern : Studying full time, Is Male, Carousel_Morning, Contacts_Night,Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend,Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,Messengers_Weekday, Contacts_Afternoon, Contacts_Noon, Contacts_Weekend,

Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

output: meaningful sentence

Example sentence: Males who study full time use Contacts and Bluetooth at anytime, Carousel in the morning and afternoon, Web at noon and night and never useMessengers in the morning.

Figure 4.3: Step 1 for Sentence Generation

Step 1: Split user information and application with time information in separatelist. In our example,

User info list ="Studying full time", "Is Male"

Time and Application list = Carousel_Morning, Contacts_Night,Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend,Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,Messengers_Weekday, Contacts_Afternoon, Contacts_Noon, Contacts_Weekend,Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

The Figure 4.3 shows first step of sentence generation. According to rule, we obtain"Studying full time" and "Is Male" as user information, therefore, sentence starts with" Males who is studying full time "

38

Step 2: Combine time info for each applications and replace with meaningful timeinformation. We create a dictionary in order to each applications. According to ex-ample,

Application list = Carousel, Messengers, Bluetooth, Web, Contacts

For Carousel , time list = Morning, AfternoonFor Messengers, time list = Afternoon, Noon, Night, Weekday, WeekendFor Bluetooth, time list = Weekday, WeekendFor Web, time list = Noon, NightFor Contacts, time list = Morning, Afternoon, Noon, Night, Weekday, Weekend (6different application times)

In dictionary = (Carousel: [Morning, Afternoon], Messengers: [Afternoon, Noon,Night, Weekday, Weekend], Bluetooth: [Weekday, Weekend], Web: [Noon, Night],Contacts: [Morning, Afternoon, Noon, Night, Weekday, Weekend])

Application Contacts has all different time information, therefore, instead of writingall time information, we replace this time information with "at anytime". ApplicationMessenger has 5 different time information. That means from 6 different time info,there is only one time information missing. That is " Morning ", so, instead of writingall time information, we replace these time informations with not Morning (!Morn-ing). For Bluetooth application, there are two time information which are Weekdayand Weekend. That means user does not use Bluetooth application any special timebecause weekand and weekday reflects all time in a week. Therefore, we replace thistime information with "at anytime". For Web application, there is not any special caseto change time information.

As you see in the figure 4.4, we put all time information with new form in mean-ingful time list.

meaningful time list = (Carousel: [Morning, Afternoon], Messengers: [!Morning],Bluetooth: [at anytime], Web: [Noon, Night], Contacts: [at anytime])

Step 3: Combine applications that are in same time and makes time order for eachapplications.

As you can see in the figure 4.5, we combine applications that are same time andlist it. In our example,

new meaningful list = (at anytime: [Contacts, Bluetooth] , (Morning, Afternoon):Carousel, (Noon, Night): Web, !Morning: Messengers)

Step 4: Add Preposition

As you can see in the figure 4.6, we add prepositions.

new meaningful list = (at anytime: [Web, Bluetooth], (in the Morning, in the After-noon): Carousel, (at Noon, at Night): Web, !Morning: Messengers )

There are two different time information for Carousel and Web application andthey are both have same preposition in front of them, therefore, we add "and" be-tween them during the sentence generation. For at anytime information, there aretwo different applications. We also add "and" between them.

sentence = Males who is studying full time use Contacts and Bluetooth at anytimeCarousel in the morning and in the afternoon Web at noon and at night and neveruse Messengers in the morning

Step 5: Add Punctuation

As you can see in the figure 4.7, we add punctuations after time informations beforeadding new application in the sentence.

sentence = Males who is studying full time use Contacts and Bluetooth at anytime,Carousel in the morning and in the afternoon, Web in at noon and at night, and neveruse Messengers in the morning daily.

Pattern summarization part assists to get idea regarding abstractive summarizationmethod which may reuse phrases or clauses from set of related document in a mean-ingful way. We try to confirm that generated sentences meet grammar expectation.Unfortunately, we could not find how to evaluate to claim that our generation is veryproper. This is left as a future work.

40



42

Chapter 5

Conclusion and Future Work

In this chapter, we conclude and present future work.

5.1 Conclusion

The majority of summarization systems continue to rely on sentence extraction since1960s. Multi-document summarization was introduced as a problem in the 1990s.Nowadays, multi-document summarization is landmark in the progression of sum-marization research and takes the place of single document summarization. There isstill a long trail to walk in this field.

Over time, both abstractive and extractive approaches have been attempted. Ab-stractive summarization requires heavily rely on the adaptation of internal tools andmachinery for language generation. This summaries are difficult to replicate and ex-tend to domains. On the other hand, simple extraction of sentences have producedsatisfactory results in multi document summarization. The recent popularity of ef-fective multi document summarization systems confirms this claim.

In this report, we have presented our multi-document summarization system whichis designed to produce summaries for bundles of scientific articles. The well knownsummarization tool MEAD is integrated in our system. This report emphasizes ex-tractive approaches to summarization using statistical methods. Since a lot of inter-esting work is being done research in this field, we have chosen to include a briefsummary on some methods that we found relevant to future research, even if theyfocus only on small details related to a general summarization process.

Our experiments based on Mechanical Turk give promising results. Results showsthat abstract and title reflect general content of text and emphasize that a short doc-ument gives better result than a long text. Keyword based summary has a strongimpact in order to obtain good quality of summaries.

In the second part of report, we have explained sentence generation for patternsextracted from data by automatic pattern mining techniques. We exploit categoriesof attributes to guide our generation process.

5.2 Future Work

The future aims of this study are the following:

• We have to work on how we can improve the quality of summaries for fullarticles.

• We have to study the effect of other summarization techniques that we couldintegrate in our system to improve summaries such as sentence planning. Wemay determine sentence which reflects authors own work or aim of documentor related work. The basis of the global context of the paper determines therhetorical status of a sentence [40][42]. We may use this approach to combinewith selected features and may give some weight with sentence status.

• We have to find a new approach to evaluate quality of generated sentences.

46

Appendices

Nouns or Adverbs or Prepositions Replace wordFirst, Firstly, Foremost, First of all, First offSecond, SecondlyThird, ThirdlyWe AuthorsAlso, Furthermore, Besides, LikewiseAs a result of, Thanks to, For the reason that,Case history one important pointCase in point one important pointFor instance one important pointKind of thing one important pointAfter all,For all thatAll the same, AnyhowBut, Despite, howbeitIn spite of, Nonetheless, NotwithstandingOn the other handPer contraThough, Without regard toBoth

Table .1: Summary Rephrased Word List

Figure .1: MTurk Qualification Test

Figure .2: Independent Evaluation example 1

Figure .3: Independent Evaluation example 2

50

Figure .4: Comparative Evaluation example 1

Figure .5: Comparative Evaluation example 2

Figure .6: SUNFLOWER Automatic summary example

52

Bibliography

[1] Noemie Elhadad. User-sensitive text summarization. In AAAI, pages 987–988,2004.

[2] Mehran Sahami and Timothy D. Heilman. A web-based kernel function formeasuring the similarity of short text snippets. In WWW, pages 377–386, 2006.

[3] Günes Erkan and Dragomir R. Radev. Lexrank: Graph-based lexical centralityas salience in text summarization. CoRR, abs/1109.2128, 2011.

[4] David M. Zajic, Bonnie J. Dorr, and Jimmy J. Lin. Single-document and multi-document summarization techniques for email threads using sentence compres-sion. Inf. Process. Manage., 44(4):1600–1610, 2008.

[5] H.P. Luhn. The automatic creation of literature abstracts. IBM Journal, 2:159–165,1958.

[6] Dipanjan Das and André F. T. Martins. A survey on automatic text summariza-tion, 2007.

[7] Jade Goldstein, Vibhu O. Mittal, Jaime G. Carbonell, and James P. Callan. Creat-ing and evaluating multi-document sentence extract summaries. In CIKM, pages165–172, 2000.

[8] Breck Baldwin and Thomas S. Morton. Dynamic coreference-based summariza-tion. In In Proceedings of the Third Conference on Empirical Methods in NaturalLanguage Processing (EMNLP-3), 1998.

[9] http://www.arxiv.org. Arxiv, 2012.

[10] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In ACL(Tutorial Abstracts), page 3, 2011.

[11] Sihem Amer-Yahia, Ruth Garcia, Aybuke Ozturk, and ShameemAhamed Puthiya Parambath. Crowd sourcing literature review in sunflower.Technical report, 2012.

[12] https://www.mturk.com/mturk. Amazon mechanical turk, 2012.

[13] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Re-trieval. ACM Press / Addison-Wesley, 1999.

[14] K. Ganapathiraju, Advisors Dr, Jaime Carbonell, and Dr Yiming Yang. Relevanceof cluster size in mmr based summarizer: A report 11-742: Self-paced lab ininformation retrieval.

[15] Klaus Zechner. A literature survey on information extraction and text summa-rization, 1997.

[16] Elke Mittendorf and Peter Schäuble. Document and passage retrieval based onhidden markov models. In In Proceedings of the Seventeenth Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, pages318–327, 1994.

[17] Hal Daumé, III and Daniel Marcu. Bayesian query-focused summarization. InProceedings of the 21st International Conference on Computational Linguistics and the44th annual meeting of the Association for Computational Linguistics, ACL-44, pages305–312, Stroudsburg, PA, USA, 2006. Association for Computational Linguis-tics.

[18] Canasai Kruengkrai and Chuleerat Jaruskulchai. Generic text summarizationusing local and global properties of sentences. In Proceedings of the 2003 IEEE/WICInternational Conference on Web Intelligence, WI ’03, pages 201–, Washington, DC,USA, 2003. IEEE Computer Society.

[19] www.summarization.com/mead. Mead, 2012.

[20] Cyril Labbé and Dominique Labbé. Inter-textual distance and authorship at-tribution corneille and moliere. Journal of Quantitative Linguistics, 8(3):213–231,2001.

[21] David M. Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation.JMLR, 3:993–1022, 2003.

[22] www.bloomsbury.com. Bloomsbury. Bloomsbury, 2012.

[23] Eduard Hovy and Chin-Yew Lin. Automated text summarization and the sum-marist system. In Proceedings of a workshop on held at Baltimore, Maryland: October13-15, 1998, TIPSTER ’98, pages 197–214, Stroudsburg, PA, USA, 1998. Associa-tion for Computational Linguistics.

[24] Manfred Stede, Heike Bieler, Stefanie Dipper, and Arthit Suriyawongkul. Sum-mar: Combining linguistics and statistics for text summarization. In Proceedingsof the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelli-gence August 29 – September 1, 2006, Riva del Garda, Italy, pages 827–828, Amster-dam, The Netherlands, The Netherlands, 2006. IOS Press.

[25] Martin Hassel. Resource Lean and Portable Automatic Text Summarization. PhDthesis, KTH, Numerical Analysis and Computer Science, NADA, 2007. QC20100712.

54

[26] Martin Hassel. Exploitation of named entities in automatic text summarizationfor swedish. In In Proceedings of NODALIDA 03 - 14 th Nordic Conference onComputational Linguistics, May 30-31 2003, 2003.

[27] Kaili Müürisep and Pilleriin Mutso. Estsum - estonian newspaper texts summa-rizer. In In Proceedings of The Second Baltic Conference on Human Language Tech-nologies, pages 311–316.

[28] Mohsin Ali, Monotosh Kumar Ghosh, and Abdullah-Al-Mamun. Multi-document text summarization: Simwithfirst based features and sentence co-selection based evaluation. In Proceedings of the 2009 International Conference onFuture Computer and Communication, ICFCC ’09, pages 93–96, Washington, DC,USA, 2009. IEEE Computer Society.

[29] Ladda Suanmali, Naomie Salim, and Mohammed Salem Binwahlan. Fuzzy ge-netic semantic based text summarization. In DASC, pages 1184–1191, 2011.

[30] Gönenç Ercan. Automated text summarization and keyphrase extraction, 2006.

[31] Kathleen McKeown and Dragomir R. Radev. Generating summaries of multiplenews articles. In SIGIR, pages 74–82, 1995.

[32] Aurélie Bertaux Eric Gaussier Aybuke Oztürk Marie-Christine Rousset Alexan-dre Termier Behrooz Omidvar Tehrani, Sihem Amer-Yahia. Interactive explo-ration of mobile phone usage patterns. Technical report, 2012.

[33] P. B. Baxendale. Machine-made index for technical literature: an experiment.IBM J. Res. Dev., 2(4):354–361, October 1958.

[34] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285,April 1969.

[35] Okurowski M. E.-Gorlinsky J. Aone, C., I. Larsen, B. A trainable summarizerwith knowledge acquired from robust nlp techniques. In Mani, and M. T. May-bury. Advances in automatic text summarization. pages 4–5, 1999.

[36] Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document sum-marizer. In Proceedings of the 18th annual international ACM SIGIR conference onResearch and development in information retrieval, SIGIR ’95, pages 68–73, NewYork, NY, USA, 1995. ACM.

[37] John Conroy and Dianne P. O’leary. Text summarization via hidden markovmodels and pivoted qr matrix decomposition. Technical report, In SIGIR, 2001.

[38] Ozge Yeloglu, Evangelos Milios, and Nur Zincir-heywood. Multi-documentsummarization of scientific corpora.

[39] Donna Harman. Overview of the trec 2002 novelty track. In Proceedings of theEleventh Text REtrieval Conference (TREC 2002), NIST Special Publication 500-251,pages 46–55, 2002.

[40] Simone Teufel and Marc Moens. Summarizing scientific articles - experimentswith relevance and rhetorical status. Computational Linguistics, 28:2002, 2002.

[41] H. Borko and C. L. Bernier. A trainable document summarizer. In AbstractingConcepts and Methods, Academic, London, 1975. Academic Press.

[42] Maher Jaoua and Abdelmajid Ben Hamadou. Automatic text summarizationof scientific articles based on classification of extract’s population. In CICLing,pages 623–634, 2003.

[43] Sihem Amer-Yahia and Ruth Garcia. Heuristics for task assignment in collabo-rative environments implemented in a simulator. Technical report, 2012.

[44] J. Steinberger and M. Krištan. Lsa-based multi-document summarization, 2007.

[45] Shameem Ahamed Puthiya Parambath. Topic extraction and bundling of relatedscientific articles. Technical report, 2012.

[46] Cory Janssen. Automatic summarization, @ONLINE, June 2012.

56

textual summarization of scientiﬁc publications and usage ... · textual summarization of...

Documents