index-based snippet generation - max planck society€¦ · index-based snippet generation...

50
Saarland University Faculty of Natural Sciences and Technology I Department of Computer Science Master’s Program in Computer Science Master’s Thesis Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor Priv.-Doz. Dr. Holger Bast Reviewers Priv.-Doz. Dr. Holger Bast Prof. Dr. Gerhard Weikum

Upload: others

Post on 17-Jun-2020

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Saarland UniversityFaculty of Natural Sciences and Technology I

Department of Computer ScienceMaster’s Program in Computer Science

Master’s Thesis

Index-based Snippet Generation

submitted by

Gabriel Manolache

on June 9, 2008

Supervisor

Priv.-Doz. Dr. Holger Bast

Advisor

Priv.-Doz. Dr. Holger Bast

Reviewers

Priv.-Doz. Dr. Holger Bast

Prof. Dr. Gerhard Weikum

Page 2: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

ii

Page 3: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Statement

Hereby I confirm that this thesis is my own work and that I have documented all sources used.

Gabriel ManolacheSaarbrucken, June 9, 2008

Declaration of Consent

Herewith I agree that my thesis will be made available through the library of the ComputerScience Department.

Gabriel ManolacheSaarbrucken, June 9, 2008

iii

Page 4: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Acknowledgments

I would like to express my most profound gratitude to Dr. Holger Bast for giving me the chanceto work in his research group, on this fascinating project. I am indebted to him for his continuousadvice, patience and willingness to discuss any questions that I have had. His professional enthusi-asm and the expertise he shared with me, will certainly be a tremendous resource for my later work.

I am grateful to both Dr. Holger Bast and Prof. Dr. Gerhard Weikum for reviewing my thesis. Iwould also like to thank Marjan Celikik for our fruitful collaboration.

Last, but not least, I would like to thank my family for their invaluable and continuous supportthroughout my studies.

iv

Page 5: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Abstract

Ranked result lists with query-dependent snippets have become state of the art in text search.They are typically implemented by searching, at query time, for occurrences of the query wordsin the top-ranked documents. This document-based approach has three inherent problems: (i)when a document is indexed by terms which it does not contain literally (e.g., related wordsor spelling variants), localization of the corresponding snippets becomes problematic; (ii) eachquery operator (e.g., phrase or proximity search) has to be implemented twice, on the indexside in order to compute the correct result set, and on the snippet generation side to generatethe appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned foroccurrences of the query words, which is problematic for very long documents.

This thesis presents an alternative index-based approach that localizes snippets by informationsolely computed from the index, and that overcomes all three problems. We show how to achievethis at essentially no extra cost in query processing time, by a technique we call query rotation.We also show how the index-based approach allows the caching of individual segments insteadof complete documents, which enables a significantly larger cache hit ratio as compared to thedocument-based approach. We have fully integrated our implementation with the CompleteSearchengine.

v

Page 6: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

vi

Page 7: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Contents

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivation of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related work 52.1 Snippets in research: the improvements proposed by Turpin et al. . . . . . . . . . 5

2.1.1 The CTS Snippet Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Caching and sentence reordering . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Snippets in practice: Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 General information retrieval notions 113.1 Positional inverted index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Advanced search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Query operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Non-literal matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Index-based snippet generation 174.1 Computing all matching positions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Extended lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Query rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.3 Experimental comparison of extended lists and query rotation . . . . . . . 23

4.2 Computing snippets positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 From snippet positions to snippet text . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Block representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.2 Compression of the blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Integration with the CompleteSearch engine . . . . . . . . . . . . . . . . . . . . . 27

5 Index-based versus Document-based Snippet Generation 295.1 Non-literal matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Code duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Large documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

Page 8: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

viii CONTENTS

6 Experiments 336.1 Datasets and Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Space consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Snippet generation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusions and Future Work 37

Page 9: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

CONTENTS ix

Page 10: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

x CONTENTS

Page 11: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 1

Introduction

1.1 Problem Statement

The usual way of interacting with an IR system is to enter a specific information need expressedas a query. As a result, the system will provide a ranked list of retrieved documents. Foreach of these documents, the user will be able to see the title and a few sentences from thedocument. These few sentences are called a snippet or excerpt and they have the role of helpingthe user decide which of the retrieved documents are more likely to convey his information need.Ideally, it should be possible to make this decision without having to refer to the full document text.

Ranked result document lists with query-dependent document snippets, as shown in Figure 1.1,are now state of the art in text search. Some of the earlier web search engines started out withstatically precomputed (hence query-independent) document summaries, but by now all majorengines have converged to showing snippets centered around the keywords typed by the user. Userstudies have shown already quite a while ago that properly selected query-dependent snippetsare superior to query-independent summaries concerning the speed, precision, and recall withwhich users can judge the relevance of a hit without actually having to follow the link to the fulldocument [TS98] [WJR03].

1.2 Previous work

A variety of methods for query-dependent snippet generation have been described in the literatureand implemented in (open source) search engines.1 They will be discussed in Chapter 2. Themethods differ in which of the potentially many snippets are extracted, and in how exactlydocuments are represented so that snippets can be extracted quickly. On a high level, however,they all follow the same principle two-step approach, which we call document-based :

(D0) For a given query, compute the ids of the top ranking documents using a suitable precom-puted index data structure.

(D1) For each of the top-ranked documents, fetch the (possibly compressed) document text, andextract a selection of segments best matching the given query.

1We can only guess how snippet generation is implemented in commercial products or in the big web searchengines.

1

Page 12: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

2 CHAPTER 1. INTRODUCTION

Figure 1.1: Three examples of query-dependent snippets. The first example shows a snippet (withtwo parts) for an ordinary keyword query; this can easily be produced by the document-basedapproach. The second example shows a snippet for a combined proximity / or query; in particular,the document-based approach needs to take special care here that only those segments fromthe document are displayed where the query words indeed occur close to each other. The thirdexample shows a snippet for a semantic query with several non-literal matches of the query words,e.g., Tony Blair matching politician. Processing of this query involves a join of information fromseveral documents, in which case the document-based approach is not able to identify matchingsegments based on the document text and the query alone. The index-based approach describedin the thesis can deal with all three cases without additional effort. For ordinary queries it is atleast as efficient as the document-based approach.

Note that we call the first step (D0) (instead of D1) because it is not really part of the snippetgeneration but rather a prerequisite. The same remark goes for (I0), the first step for ourindex-based approach, described below.

1.3 Motivation of our work

In this thesis, we investigate the following index-based approach to snippet generation, whichto our knowledge has not been studied in the literature so far. The goal of this approach isto overcome the problems that affect the document-based approach. We will show that theindex-based method is at least as efficient as the document-based approach for ordinary keywordqueries, that it is superior when it comes to non-literal matches, advanced query operators, largedocuments, and that it is without alternative for complex query languages involving joins. Amore detailed comparison of the document-based and index-based approaches will be presentedin Chapter 5. The index-based approach goes in four steps:

(I0) For a given query, compute the ids of the top-ranking documents. This is just like (D0).

(I1) In addition to the information from (I0), also compute, for each query word, the matchingpositions of that word in each of the top-ranking documents.

We will show how this step can be realized with negligible extra time, compared to (I0) and(I3), and for any ordinary positional index, without a need to change the index’s internal data

Page 13: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

1.3. MOTIVATION OF OUR WORK 3

structures. The latter can be a major issue from a system’s engineering point of view, dependingon the design of the system.

(I2) For each of the top-ranked documents, given the list of matching positions computed in (I1)and given a precomputed segmentation of the document, compute a list of positions to be output.

We will show that this step is a simple matter of a k-way merge / intersection, and that it takesnegligible time compared to (I0) and (I3).

(I3) Given the list of positions computed in (I2), produce the actual text to be displayed to theuser.

We will show how to preprocess (in particular: compress) the documents, so that this step canbe done efficiently, accessing only those portions of the (possible very large) document whichactually contain the snippets to be output.

The index-based approach presented in this thesis is a joint work with Dr. Holger Bast and MarjanCelikik. It led to a submission to the ACM 17th Conference on Information and KnowledgeManagement. My work was mainly concentrated on steps (I1) and (I2). They are going to bepresented into more detail compared to the remaining parts of the method (step (I3) and snippetcaching).

Page 14: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4 CHAPTER 1. INTRODUCTION

Page 15: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 2

Related work

The literature abounds with work on the general topic of document summarization. However, mostof this work is concerned with query-independent summarization, which is not our topic here. Fora recent list of references, see [VH06]. Note that all the big search engines have query-dependentsnippets by now, in particular: Google, Yahoo Search, MSN Search, and AltaVista (which startedwith query-independent summaries).

Tombros and Sanderson [TS98] presented the first in-depth study showing that properly selectedquery-dependent snippets are superior to query-independent summaries with respect to speed,precision, and recall with which users can judge the relevance of a hit without actually havingto follow the link to the full document. In this work, we take the usefulness of query-dependentresult snippets for granted.

Usually more segments match the query than can (and should) be displayed to the user, sothat a selection has to be made. Questions of a proper such selection / ranking have beenstudied by various authors. For example, in a recent work, [VH06] have proposed an algo-rithm for computing segments that are as semantically related to each other as possible. Here,in this thesis, the focus is on efficiency and feasibility aspects. For the scoring and ranking ofsegments, we adopt the simple yet effective scheme from [TTHW07], described below in Figure 2.1.

Only recently, Turpin et al. [TTHW07] have presented a first in-depth study of the document-based approach with respect to efficiency (in time and space).

2.1 Snippets in research: the improvements proposed byTurpin et al.

The work presented in [TTHW07] follows the document-based approach. The authors proposeand analyze a document compression method that reduces snippet generation time by 58% overa baseline using the zlib compression library. The experiments reveal that finding documentson secondary storage dominates the total cost of generating snippets, and use caching for afaster snippet generation process. A solution for avoiding scanning the whole documents is alsoproposed and analyzed: document reordering and compaction. The method increases the numberof document cache hits but has an impact on snippet quality. This scheme doubles the number ofdocuments that can fit in a fixed size cache.

5

Page 16: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

6 CHAPTER 2. RELATED WORK

Input: A document broken into one sentence per line, and a sequence of query terms.Output: Remove the number of sentences required from the heap to form the summary.1 For each line of text, L = [w1, w2, .., wm]2 Let h be 1 if L is a heading, 0 otherwise.3 Let l be 2 if L is the first line of a document, 1 if it is the second line, 0 otherwise.4 Let c be the number of wi that are query terms, counting repetitions.5 Let d be the number of distinct query terms that match some wi.6 Identify the longest contiguous run of query terms in L, say wj ...wj+k.7 Use a weighted combination of c, d, k, h and l to derive a score s.8 Insert L into a max-heap using s as a key.

Figure 2.1: Simple sentence ranker that operates on raw text with one sentence per line

2.1.1 The CTS Snippet Engine

As we already explained, step (D1) must provide for each of the top-ranked documents, asnippet: text that attempts to summarize the document and contains (at least some of the)query words. Similarly to previous work on summarization [Luh58], Turpin and al. identify thesentence as the minimal unit for extraction and presentation to the user. In order to constructa snippet, all sentences in a document are ranked against the query and the top two or threereturned as a snippet. The scoring of sentences against queries has been explored in severalpapers [GKMC99, Luh58, SSJ01, TS98, WRJ02], with different features of the sentences deemedimportant. Based on these studies, Figure 2.1 presents the general algorithm for scoring sentencesin relevant documents. The top-ranked sentences form the snippet. The final score of the sentence,assigned in step 7, can be computed in various ways. In order to avoid bias to any particularscoring mechanism, Turpin and al. compare the sentence quality using the individual componentsof the score, rather than an arbitrary combination of the components.

Since Web data is often poorly structured, poorly punctuated, and contains a lot of data that donot form part of valid sentences that would be candidates for parts of snippets, it is assumedthat the documents passed to the Snippet Engine have all HTML tags and JavaScript removed,and that each document is reduced to a series of word tokens separated by non-word tokens. Aword token is defined as a sequence of alphanumeric characters, while a non-word is a sequence ofnon-alphanumeric characters such as whitespace and the other punctuation symbols. Both arelimited to a maximum of 50 characters. Adjacent, repeating characters are removed from thepunctuation. Included in the punctuation set is a special end of sentence marker which replacesthe usual three sentence terminators ”?!.”. Often these explicit punctuation characters are missingand so HTML tags such as <br> and <p> are assumed to terminate sentences. In addition, asentence must contain at least five words and no more than twenty words, with longer or shortersentences being broken and joined as required to meet these criteria [KPC95]. UnterminatedHTML tags, that is, tags with an open brace, but no close brace, cause all text from the openbrace to the next open brace to be discarded.

Having defined the format of documents that are presented to the Snippet Engine, the nextimportant characteristic of this method is the Compressed Token System (CTS) document storagescheme, and the baseline system used for comparison. For the baseline, an obvious documentrepresentation scheme is utilized: each document is simply compressed with a well known adaptivecompressor, and then decompressed as required [BP98]. Such a system has been implemented

Page 17: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

2.1. SNIPPETS IN RESEARCH: THE IMPROVEMENTS PROPOSED BY TURPIN ET AL.7

by the authors using zlib [GA07] with default parameters to compress every document. Eachdocument is stored in a single file. While manageable for small test collections or small enterpriseswith millions of documents, a full Web search engine may require multiple documents to inhabitsingle files, or a special purpose file system [HL03]. For snippet generation, the required documentsare decompressed one at a time, and a linear search for provided query terms is employed. Thesearch is optimized for matching whole words and the sentence terminating token, rather thangeneral pattern matching.

The CTS Snippet Engine makes a series of optimizations over the baseline presented in theprevious paragraph. The first is to employ a semi-static compression method over the entiredocument collection, which will allow faster decompression with minimal compression loss. Usinga semi static approach involves mapping words and non-words produced by the parser to singleinteger tokens, with frequent symbols receiving small integers, and then choosing a coding schemethat assigns small numbers a small number of bits. Words and non-words strictly alternate in thecompressed file, which always begins with a word. Each symbol is assigned its ordinal numberin a list of symbols sorted by frequency. The the vbyte coding scheme is used to code the wordtokens [IHWB99]. The set of non-words is limited to the 64 most common punctuation sequencesin the collection itself, and are encoded with a flat 6-bit binary code. The remaining 2 bits ofeach punctuation symbol are used to store capitalization information.

The process of computing the semi-static model is complicated by the fact that the number ofwords and non-words appearing in large web collections is high. Moffat et al. [AMS97] haveexamined schemes for pruning models during compression using large alphabets, and concludethat rarely occurring terms need not reside in the model. Rather, rare terms are spelt out inthe final compressed file, using a special word token (escape symbol), to signal their occurrence.Using these results, Turpin et al. employ two move-to-front queues: one for words and one fornon-words. During the first pass of encoding, whenever the available memory is consumed and anew symbol is discovered, an existing symbol is discarded from the last half of the queue providedthat it has the frequency one. If there is no such symbol, then a symbol with frequency twois evicted and so on. The second pass of encoding replaces each word with its vbyte encodednumber, or the escape symbol and an ASCII representation of the word if it is not in the model.Similarly each non-word sequence is replaced with its codeword, or the codeword for a single spacecharacter if it is not in the model. This lossless compression of non-words is acceptable whenthe documents are used for snippet generation, but may not be acceptable for a document database.

Besides the fact that it allows faster decompression, the semi-static scheme also has the advantagethat it readily allows direct matching of query terms as compressed integers in the compressedfile. This way, sentences can be scored without having to decompress a document, and only thesentences returned as part of a snippet need to be decoded.

The CTS system stores all documents contiguously in one file, and an auxiliary table of 64 bitintegers indicating the start offset of each document in the file. Further, it must have access tothe reverse mapping of term numbers, allowing those words not spelt out in the document tobe recovered and returned to the Query Engine as strings. The first of these data structurescan be readily partitioned and distributed if the Snippet Engine occupies multiple machines; thesecond, however, is not so easily partitioned, as any document on a remote machine might requireaccess to the whole integer-to-string mapping. This is the second reason for employing the modelpruning step during construction of the semi-static code: it limits the size of the reverse mappingtable that should be present on every machine implementing the Snippet Engine.

Page 18: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

8 CHAPTER 2. RELATED WORK

2.1.2 Caching and sentence reordering

In order to speed up the snippet generation process, Turpin et al. make use of a SnippetEngine cache in proportion to the size of the document collection. Using simulation, the au-thors have compared two caching policies: a static cache, where the cache is loaded with asmany documents as it can hold before the system begins answering queries, and then neverchanges; and a least-recently-used cache, which starts out as for the static cache, but when-ever a document is accessed it moves to the front of a queue, and if a document is fetchedfrom disk, the last item in the queue is evicted. It is assumed that a query cache exists forthe top Q most frequent queries, and that these queries are never processed by the Snippet Engine.

The results of the experiments reveal that the static cache performs well but it is outperformedby the LRU cache: if as little as 1% of the documents are cached then around 75% of the diskseeks can be avoided.

Besides compression, another approach for reducing the size of the documents in the cache isexplored: instead of caching the whole documents, only sentences that are likely to be used insnippets are stored. If during snippet generation on a cached document the sentence scores donot reach a certain threshold, then the whole document is retrieved from disk. This means thatit is very important which sentences from a document are stored in the cache and which are lefton the disk.

A sentence reordering can be performed for each document such that sentences that are verylikely to appear in snippets are at the front of the document. This way, they are processed first atquery time and once enough sentences are found to generate a snippet, the rest of the documentcan be ignored. Further, to improve caching, only the head of each document can be stored in thecache, with the tail residing on disk. One problem of this approach is that now the search engineshould be able to provide copies of the documents. The snippet generation engine cannot do thisanymore since it does not have access to the exact text of the document, as it was indexed. Foursentence reordering schemes are considered:

• Natural order The first few sentences of a well authored document usually best describethe document content [Luh58]. Thus simply processing a document in order should yieldquality snippet. Unfortunately, web documents are often not well authored, with littleeditorial or professional writing skills brought to bear on the creation of a work of literarymerit. Also, taking into account that query-biased snippets are being produced, there is noguarantee that query terms will appear in sentences toward the front of a document.

• Significant terms (ST) Luhn introduced the concept of a significant sentence as containinga cluster of significant terms [Luh58], a concept found to work well by Tombros and Sanderson[TS98]. Let fd,t be the frequency of term t in document d, then term t is determined to besignificant if

fd,t ≥

7− 0.1× (25− sd), if sd < 257, if 25 ≤ sd ≤ 40

7 + 0.1× (sd − 40), otherwise

where sd is the number of sentences in document d. A bracketed section is defined asa group of terms where the leftmost and rightmost terms are significant terms, and nosignificant terms in the bracketed section are divided by more than four non-significantterms. The score of a bracketed section is the square of the number of significant words

Page 19: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

2.2. SNIPPETS IN PRACTICE: LUCENE 9

falling in the section, divided by the total number of words in the entire sentence. Thea priori score for a sentence is computed as the maximum of all scores for the bracketedsections of the sentence. The sentences are then sorted by this score.

• Query log based (QLu) Many Web queries repeat, and a small number of queries makeup a large volume of total searches [BJP05]. In order to take advantage of this bias,sentences that contain many past query terms are promoted to the front of a document,while sentences that contain few query terms are demoted. In this scheme, the sentencesare sorted by the number of sentence terms that occur in the query log. To ensure thatlong sentences do not dominate over shorter qualitative sentences the score assigned to eachsentence is divided by the number of terms in that sentence giving each sentence a scorebetween 0 and 1.

• Query log based (QLu) This scheme is as for QLt, but repeated terms in the sentenceare only counted once.

The purpose of using the ST, QLt or QLu schemes is to terminate snippet generation earlier thanif Natural Order is used, but still produce sentences with the same number of unique query terms(d in Figure 2.1), total number of query terms (c), the same positional score (h+ l) and the samemaximum span (k).

Experiments revealed that sorting sentences using the Significant Terms (ST) method leads tothe smallest change in the sentence scoring components. The greatest change over all methods isin the sentence position (h+ l) component of the score, which is to be expected as there is noguarantee that leading and heading sentences are processed at all after sentences are re-ordered.The second most affected component is the number of distinct query terms in a returned sentence,but if only the first 50% of the document is processed with the ST method, there is a drop ofonly 8% in the number of distinct query terms found in snippets.

The authors argue that there is little overall affect on scores when processing only half thedocument using the ST method, depending how the various components are weighted to computethe overall snippet score.

In conclusion, the Turpin et al. method follows the document-based approach, providing a series ofmechanisms for improving the speed of access to the document text: caching, sentence reorderingand a semi static compression method for faster decompression.

2.2 Snippets in practice: Lucene

All open-source engines we know of that provide query-dependent snippet generation follow thedocument-based approach. In particular, this holds true for Lucene [Cut]. Lucene is one of themost popular free/open source information retrieval libraries providing support for indexing andsearch. It is supported by the Apache Software Foundation and is released under the ApacheSoftware License.

The early versions of Lucene use a straightforward implementation of the document-based ap-proach: at query time each of the top-ranked documents are re-parsed and segments containingthe query words are extracted. The highlighting of the query words in these segments is doneby simply comparing the tokens of the document with the tokens in the query. The tokens ofthe document can be built from the index or by analyzing the document. If there is a match

Page 20: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

10 CHAPTER 2. RELATED WORK

between one token in the document and one token in the query, then the original text can bereconstructed, with the query words highlighted, by using stored information about the originaloffsets in the document for each of the tokens.

The Lucene highlighter can handle stemmed words since both the query terms and the documentcontent are processed by the same parser. This means that words such as ”algorithmic”, ”al-gorithm”, ”algorithms” become stemmed to the same root form in both query and documentcontent. The tokens produced by the parser include the byte offsets of the original full word, notjust the stemmed form, so the highlighter knows the full extent of what to highlight in text.

The Lucene highlighter manages the fuzzy queries by expanding them to all the similar terms.An example of such a fuzzy query is the erroneous word belies: it might be expanded to thedisjunctive query belief believe. Such an approach has the obvious disadvantage that the numberof query words will increase depending on the set of similar terms. This represents an importantscalability problem.

To address the efficiency problems of the previously presented approach, recent versions of Lucenehave provided support for storing the sequence of term ids output by the parser (much in the veinof [TTHW07]) and even precompute and store for each document a small index for fast locationof the query words.

However, all of these enhancements remain in the realm of the document-based approach, and assuch do not address the important problems of non-literal matches, code duplication, and coarsecaching granularity pointed out in Chapter 5.

Page 21: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 3

General information retrievalnotions

This chapter describes the basic IR concepts and notions required by our snippet generationmethod. The first part introduces the data structure used for query processing and the operationsperformed on this data structure. Here is where we also describe the first (common) step for bothindex-based and document-based snippet generation methods. The second part provides detailsabout the advanced search options that our method supports in an efficient manner.

3.1 Positional inverted index

The inverted index, sometimes known as inverted file or postings file, is one of the major conceptsin information retrieval. It is the data structure of choice for most of the search applications: itis very efficient for short queries, easy to implement and can be well compressed.

The retrieval is performed on a group of documents known as (document) collection or corpus (abody of texts). Based on the document collection a lexicon or vocabulary is built: a list of theterms that appear in the corpus. An inverted index contains, for each term in the lexicon, aninverted list that stores a list of pointers to all occurrences of that term in the main text, whereeach pointer is, in effect, the number of a document in which the term appears [IHWB99].

Here is an example of such an inverted list:

doc ids D100 D129 D1401 D2722 D3000

The meaning is that the term to which this inverted list belongs, appears in the documents withthe ids D100, D129, D1401, D2722 and D3000.

Our method requires a positional inverted index. This simply means that each inverted list muststore, for each document, the positions where the corresponding term appears. For such an index,an inverted list conceptually looks as follows:

11

Page 22: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

12 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

For example, the first inverted list entry (also called posting) has the meaning that the word/termwith id W12 occurs in the document with id D7 at position 6 with a score of 2. It is importantto note that the doc ids are sorted in increasing order. Some doc ids are repeated if the termappears more than once in the document with the repeated id. For each occurrence of the termin a document, a score that reflects the importance of that particular occurrence is assigned.Individual scores are aggregated to per-document scores, according to which the documentsare eventually ranked. Based on the aggregated scores, only the most relevant documents arepresented to the user.

In the inverted list presented above, all postings have the same value for the word ids entry,namely W12. This means that the inverted list belongs to a single word, for example roman.If we want to search for all the words that start with rom, then our query word will be rom*.The asterisk is used to denote the fact that any sequence of characters might follow after rom.Assuming that our collection contains only roman and rome as words that contain rom as prefix,then the result of our query should return the doc ids, positions and scores for these two words.The result posting list might look as follows:

doc ids D7 D23 D31 D47 D47 D63 D87 D99positions 6 15 12 26 38 87 42 11word ids W12 W12 W67 W12 W12 W12 W67 W67scores 2 10 10 16 8 7 10 12

It can be noticed that this inverted list, belonging to the query word rom*, contains the postingscorresponding to the query word roman: the ones for which the value of word ids is W12. It alsocontains the postings for the query word rome (having W67 as word ids value). This kind ofprefix queries are the reason for using a word ids entry for each of the postings.

Given a query, the basic operation will be to either intersect the lists for each of the query words(to obtain the list of ids of all documents that contain all query words; such queries are calledconjunctive) or to compute their union (to obtain the list of ids of all documents that contain atleast one of the query words; such queries are called disjunctive). Let us consider an example forthe conjunctive query roman emperor julius, where the query words have the following invertedlists:

• roman

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

• emperor

Page 23: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

3.1. POSITIONAL INVERTED INDEX 13

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1

• julius

doc ids D7 D7 D39 D47 D63positions 2 39 39 28 88word ids W27 W27 W27 W27 W27scores 6 33 19 8 16

In order to compute the inverted list for the entire query, the lists of the individual query wordsmust be intersected according to the doc ids. For our example, the resulting inverted list is thefollowing:

doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63positions 2 6 31 39 26 27 28 38 81 87 88word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9scores 51 51 51 51 48 48 48 48 24 24 24query word 3 1 2 3 1 2 3 1 2 1 3

It can be observed that the postings from the result inverted list have one more entry, namedquery word. The problem is that after intersecting the posting lists of some query words, weneed to know for each of these query words all the positions in the common documents (returnedby the intersection). The query word entry tells from which query words, the posting stems.Without this it would not be possible to say which position refers to which query word. Moredetails about this topic will be presented in Chapter 4. At this point it is important to noticethat this straightforward approach has two major drawbacks: the result inverted list has beenchanged by adding one more entry for each posting and it is also very long. We have only threecommon document ids and eleven postings.

In practice, the resulted inverted lists might contain millions of different doc ids so the ranking ofthe postings is essential, in order to consider only the top K most relevant documents. The rankingis done based on the scores from the result list. They are obtained through score aggregation,during the intersection procedure. For our example the aggregation function is the sum of thescores corresponding to the same doc id. If we consider the document with the id D7, we noticethat the scores associated with the occurrences of the three query words are 2, 10, 6 and 33. Inthe result list, the score associated with D7 is the sum of all these scores: 51. The scores for therest of the doc ids are computed using the same aggregation function.

Let us assume that we are only interested in the two most relevant documents obtained as resultsfor the roman emperor julius query. In this case only the two postings corresponding to thehighest two scores are kept in the result list:

Page 24: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

14 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS

doc ids D7 D7 D7 D7 D47 D47 D47 D47positions 2 6 31 39 26 27 28 38word ids W9 W9 W9 W9 W9 W9 W9 W9scores 51 51 51 51 48 48 48 48query word 3 1 2 3 1 2 3 1

The example we have presented in this section illustrates the main steps of how a query can beprocessed by a search engine. Computing the ids of the top ranking documents represents thefirst step of the index-based method and also the first step of the document-based approach. Weare calling this step, (I0) (Index-based) or (D0) (Document-based), depending on the snippetgeneration method. We are using the (D0) / (I0) notation for this step (instead of (D1) / (I1))because it is not really part of the snippet generation but rather a prerequisite.

3.2 Advanced search

Besides the usual conjunctive or disjunctive queries, our method supports without additional effort,queries with operators and advanced queries (synonym search, error-tolerant search, semanticsearch, etc). These two categories of queries are problematic when handled by the document-basedmethod in its basic form. The following two subsections present descriptions and examples forthe two above mentioned categories of queries.

3.2.1 Query operators

A query operator consists of one or more characters that act not as a query word, but as aninstruction on how a query is to be processed. An operator can work at word level, where it appliesto a single query term, or at query level, where its presence affects the processing of the entire query.

Our method is completely unaware of the query words and operators that are part of the query.It uses only the positions provided by the positional inverted index to identify the words for whichsnippets must be generated. This means that the snippet generation will work with any queryoperator. We have experimented with the phrase operator, proximity operator and the disjunctionoperator. The phrase operator requires that two query words are adjacent, the proximity operatorrequires that two query words are at most within 5 words distance of each other and using thedisjunction operator a search can be performed for any one of a group of two or more querywords. Let us consider the query roman.emperor which contains the phrase operator and thefollowing text:

”The Roman Emperor was the ruler of the Roman State during the imperial period (starting atabout 27 BC). The Romans had no single term for the office: Latin titles such as imperator (fromwhich English emperor ultimately derives), augustus, caesar and princeps were all associated withit.”

As results for this query, only the occurrences of roman followed immediately by emperor willbe considered (they are emphasized with bold characters in the text above). The non-adjacentoccurrences of roman and emperor (marked with italic characters) are not considered since ourquery contains the phrase operator between the two words. Such operator queries do not requireadditional effort because everything is done during the intersection procedure. If we consideragain the inverted lists of the two query words:

Page 25: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

3.2. ADVANCED SEARCH 15

• roman

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

• emperor

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1

The intersection procedure will simply check whether there are any consecutive positions for thecommon document ids in the two lists. In our example, only for document id D47 we have theconsecutive positions 26 and 27, so the result inverted list for our query is:

doc ids D47scores 32

The other two types of operators can be handled in a similar fashion. Without a positionalinverted index, processing such operator queries would take more time. First the documents thatsimply contain the query words would have to be identified and then all these documents wouldhave to be scanned in order to compute the distances between the query words.

3.2.2 Non-literal matches

The index-based snippet generation is also able to deal with the vocabulary mismatch problem,encountered when a relevant document does not literally contain (one of) the query words. Anice list of commented examples from the TREC benchmarks is given in [Buc04].

Let us consider the semantic query concept:city, which means that we are looking for all thenames of cities in the collection and not for the occurrences of the word city. We have marked thisfact by prefixing the query with the concept keyword. Given this scenario, it would be desirablethat the following excerpt is taken into consideration:

”Rome achieved great glory under Octavian/Augustus. He restored peace after 100 years of civilwar; maintained an honest government and a sound currency system;”

The reason for considering this snippet is that it contains the word Rome which can be thought ofas a query word (it is the name of a city). A common way to provide such behavior and overcomethe vocabulary mismatch problem is to index documents under terms that are related to words inthe document, but do not literally occur themselves. Let us take as an example a collection thatcontains only Rome and Alexandria as names of cities and they have the following inverted lists:

• Alexandria

Page 26: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

16 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS

doc ids D64 D88positions 33 66word ids W42 W42scores 17 19

• Rome

doc ids D31 D87 D99positions 12 42 11word ids W67 W67 W67scores 3 10 12

Then, the inverted list for the term concept:city is composed of all the postings belonging to theprevious two terms:

• concept:city

doc ids D31 D64 D87 D88 D99positions 12 33 42 66 11word ids W67 W42 W67 W42 W67scores 3 17 10 19 12

There is a variety of advanced search features which call for non-literal matches: related-wordssearch (like in our example), prefix search (for a query containing alg, find algorithm), error-correcting search (for a query containing algorithm, find the misspelling algoritm), semanticsearch (for a query musician, find the entity reference John Lennon), etc.

Page 27: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 4

Index-based snippet generation

On a high level, our snippet generation method follows a four-step approach. We have alreadydiscussed the first step (I0) in Chapter 3. This chapter presents in detail the remaining threesteps of our method. In the first part we describe and analyze two methods for computing thepositions of the query words in the top-ranked documents (I1), using a positional inverted index.We continue by showing how to use the positions of the query words in order to determine thepositions of the snippets (I2). The third part is a section dedicated to the last step of our method:using the snippet positions to get the actual snippet text, which is finally presented to the user(I3). We finalize this chapter by presenting some details about the snippet caching mechanismand the integration with the CompleteSearch engine.

4.1 Computing all matching positions

Throughout this work, we assume that we are using the positional index presented in Chapter 3,with posting lists that conceptually look as follows:

doc ids D401 D1701 D1701 D1701 D1807word ids W173 W173 W173 W173 W173positions 5 3 12 54 9scores 0.3 0.7 0.4 0.3 0.2

For example, the third posting from the right says that the word with id W173 occurs in thedocument with id D1701 at position 12, and was assigned a score of 0.4. In an actual implemen-tation these lists would be stored in compressed format, but we need not consider this level ofdetail in the considerations that follow.

A standard inverted index will have one such list precomputed for each word (that is, all wordids are the same in each such list). Pruning techniques are often used to avoid a full scan ofthe index lists involved, especially in the case of disjunctive queries; for example, see [AM06][BMS+06]. The results presented in Subsection 4.1.3 pertain to such a pruning technique fordisjunctive queries, following the ideas of [BMS+06].

The second step of our method, named (I1), uses such a positional inverted index: it takes asinput the doc ids of the top ranked documents and the posting lists of the query words (thatmake up the user’s query). In order to compute the doc ids of the top ranked documents, the

17

Page 28: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

18 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

(I0) step must load from the disk the posting lists of the query words. This means that (I0) canprovide to (I1) these posting lists without additional effort. If we consider again the examplefrom Chapter 3, the conjunctive query roman emperor julius, then the input of step (I1) wouldbe the doc ids of the top ranked documents and the inverted lists of the query words:

• top ranked documents

doc ids D7 D47scores 51 48

• roman

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

• emperor

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1

• julius

doc ids D7 D7 D39 D47 D63positions 2 39 39 28 88word ids W27 W27 W27 W27 W27scores 6 33 19 8 16

The I1 step provides as output the positions of each of the query words in each of the top-rankeddocuments. This means that the output of step (I1) for this example are the positions of each ofthe query words in the documents with the ids D7 and D47:

• D7

– roman: 6– emperor: 31– julius: 2, 39

• D47

– roman: 26, 38– emperor: 27– julius: 28

The positional inverted index we described so far, obviously contains all the information requiredto compute the output. The difficulty is how to incorporate such information in the result. Thefollowing two subsections present two approaches for computing the output of step (I1) and theyare followed by a section dedicated to an experimental comparison of these approaches.

Page 29: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4.1. COMPUTING ALL MATCHING POSITIONS 19

4.1.1 Extended lists

A first and obvious approach for solving this problem is to use extended lists. An extended list isobtained in a straightforward way, by enhancing the postings of an inverted list with an additionalentry, telling from which query word it stems. Conceptually:

doc ids D1701 D1701 D1701 D1701 D1701word ids W173 W173 W173 W173 W173positions 5 9 12 54 92scores 0.3 0.7 0.4 0.3 0.2query word 1 1 1 2 2

For example, the third posting from the right now ‘knows’ that it stems from the first queryword. (When reading a list from disk, the query word entry would be set to some special value).It is not hard to see that with this enhancement, the information required for step (I1) can becomputed for all of the operations described above. Coming back to the example we used so far,the posting lists of the three query words could be enhanced in the following way:

• roman

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7query word 1 1 1 1 1

• emperor

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1query word 2 2 2 2

• julius

doc ids D7 D7 D39 D47 D63positions 2 39 39 28 88word ids W27 W27 W27 W27 W27scores 6 33 19 8 16query word 3 3 3 3 3

All the postings belonging to the inverted list of the query word roman have an extra entry(denoted as query word in our example) which has always the value 1. This simply marks thefact that the postings contain information referring to the first query word. In a similar way,the postings of the second query word (emperor) and the third query word (julius) contain anadditional entry with values 2, respectively 3. The result extended list is computed by intersectingthe extended lists belonging to the three query words, and looks as follows:

Page 30: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

20 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63positions 2 6 31 39 26 27 28 38 81 87 88word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9scores 51 51 51 51 48 48 48 48 24 24 24query word 3 1 2 3 1 2 3 1 2 1 3

Given such a list, it is straightforward how to extract the positions of each of the query wordsin each of the top-ranked documents. It can be done linearly by examining each posting once.There are two major disadvantages with this approach, however. The first is that we modifiedthe central data structure, the sanctuary of every search engine. Unless the search engine waswritten with such modifications in mind, this is usually not tolerable.

The second major problem is efficiency. Consider an intersection of two query words. Theextended result lists would now contain postings from both input lists, that is, it at least doublesin size compared to the corresponding simple result list. This effect aggravates as the number ofquery words grows. For our previous example, the simple list result has only two postings andthe extended list result has eight postings. The experimental results presented in Subsection4.1.3 show that this indeed affects the processing time. The problem is that per-query-wordpositions are computed for all matching documents, before the ranking is done. Note that thisalso happens when pruning techniques are involved, because also there large numbers of postings(namely, candidates for one of the top-ranked doc ids) are involved in the intermediate calculations.

4.1.2 Query rotation

We propose to implement (I1) by what we call query rotation. We first explain query rotation byusing an example: the three-word phrase query roman.emperor.julius. This means that the threequery words should appear in the text one after another in the specified order. We assume thatthe posting lists for the query words are the ones we have used so far. They are provided by thestep (I0) along with the set D of ids of the top-ranked documents:

doc ids D7 D47scores 51 48

First we compute the posting list L′1 as the intersection of D and the list of all postings forroman:

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

For each doc id in the intersection we store all the positions from the second list, obtaining L′1:

doc ids D7 D47 D47positions 6 26 38word ids W12 W12 W12scores 2 16 8

Page 31: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4.1. COMPUTING ALL MATCHING POSITIONS 21

Clearly L′1 is now a superset of the desired positions for roman, and typically a strict one, becausenot all occurrences of roman will be followed by emperor julius. Now we compute the posting listL′2 as the intersection of L′1 with the list of postings for emperor :

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1

When computing this intersection we take into account that roman and emperor are separatedby the phrase operator ”.”. We write into the result (L′2) only those positions from the secondlist which occur indeed one to the right of a position from L′1 obtaining:

doc ids D47positions 27word ids W34scores 16

Similarly, we compute the intersection of L′2 with the posting list of julius:

doc ids D7 D7 D39 D47 D63positions 2 39 39 28 88word ids W27 W27 W27 W27 W27scores 6 33 19 8 16

Again obeying the phrase operator (now the one between emperor and julius) we obtain theresult of the intersection, L′3:

doc ids D47positions 28word ids W27scores 8

It is not hard to see that L′3 now contains exactly the desired set of positions for julius. Below,we will say that the query can be evaluated from left to right, a property that holds true for allqueries considered in this work.

Given L′3, we now compute L2 as the intersection of L′3 with, again, the posting list for emperor,but now writing only those postings from the second list to the result which occur one to theleft of a position from L′3. Formally, we are considering the inverse of the phrase operator here.All operators considered in this work trivially have such an inverse. For our example L2 has thefollowing form:

doc ids D47positions 27word ids W34scores 16

Page 32: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

22 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

Input: Two posting lists, p and q.Output: A posting list r which is the result of the intersection of p and q.

Initialize lowerBound with value -1.For each document id d from p

Binary search for d in posting list q, starting with position lowerBound+ 1.If d appears in posting list q

Copy all postings, which have d as document id, from q to r.Store in lowerBound the position of the last posting from q which has d as document id.

Figure 4.1: Binary search intersection algorithm for two posting lists

L2 is now exactly the set of desired positions for emperor and from L2, we can analogouslycompute L1 as the set of matching positions for julius in documents from D:

doc ids D47positions 26word ids W12scores 16

Since D is the list of top-ranked document ids and thus very short, each of L′1, L′2, L′3, L2, andL1 will be very short too, and all of these intersections can be computed extremely fast. Ourintersection routine takes advantage of the fact that the document ids are sorted in ascendingorder for all posting lists. As already mentioned, while computing the Query Rotation, for allthe cases when two posting lists must be intersected, at least one of the two is very short. Ourintersection algorithm simply binary searches each document id from the shorter posting list inthe other list provided as input, as shown in Figure 4.1.

Once a document id from the first list is found in the second, its position in the second list isstored. The search for the next document id from the first list, starts at this stored position fromthe second list. This optimization is possible since all the document ids are sorted ascending. Itwould not be efficient to start the search from the beginning of the second list, knowing thatpart of the document ids are smaller than the one we are currently searching for. All thesedocument ids can be skipped. It is also important to note that if a document id found in the secondlist has more than one corresponding posting, then all these postings are included in the result list.

Our experiments in Subsection 4.1.3 will show that the time required for (I1) using query rotationis negligible compared to both the time for (I0) (the query processing that has to done anyway)and the time needed for (I3) (getting the snippet text from the documents).

More generally and formally, we have the following result regarding Query Rotation:

Lemma: For a given query Q1 ◦1 · · · ◦l−1 Ql, assume that we have the following information:

• the set D of ids of the top-k matching documents, for some integer k;

• for each Qi, a list of postings covering all doc ids from D (in the simplest case this will be allpostings matching Qi, but with techniques like those from [AM06] [BMS+06], it could be justa subset).

Page 33: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4.2. COMPUTING SNIPPETS POSITIONS 23

Assume that all query operators are invertible and associative. Then step (I1), that is, computingfor each Qi the set of matching postings (and hence positions) in each of the documents from D,can be run in time O(k ·

∑li=1 log |Qi|), where |Qi| is the list of postings provided for Qi.

Proof: We start by computing L′1 = D ◦Q1, which we define as the list of all those positionsmatching the second operand, Q1 in this case, which obey the operator, ◦ in this case. The◦ operator just asks for matching document ids, and so L′1 contains all postings pertaining tooccurrences of Q1 in documents with ids in D.

As in the example above, we go on by iteratively computing L′i = Li−1 ◦i−1 Qi for i = 2, . . . , l.By associativity, L′l now exactly contains the matching positions for Ql in documents from D. Asin the example above, we now compute Li = Li+1 ◦−1

i Qi by iterating from i = l − 1, . . . , 1 inreverse order. Then, L1, . . . , Ll contain the desired sets of positions.

Since two lists of sizes x and y can be intersected in time O(x · log y) (just binary search theelements of the first list in the second as described in Figure 4.1), and |D| = k, the total time forall the 2l − 1 intersections is bounded by O(k ·

∑li=1 log |Qi|).

As shown in Figure 4.1, our construction requires a routine for intersecting two posting lists whichwrites selected postings (in particular, the positions) from the second list to the result list. Sucha routine is likely to be provided by most search engines (our own prototype system, which wasdeveloped long before research on this thesis started, has it), and if not, should not be hard toprovide. In any case, the proposed query rotation works with existing standard data structures.

4.1.3 Experimental comparison of extended lists and query rotation

This section is dedicated to an experimental comparison of the extended lists and query rotationapproaches. For this experiment we have used the TREC Terabyte collection with 25,204,103documents from the TREC GOV2 corpus (http://www-nlpir.nist.gov/projects/terabyte), and10,000 queries from the TREC 2006 efficiency track, for example: first landing moon.

Table 4.1 below shows that the extended lists would incur a large penalty: the query processingtime more than doubles. This can be seen by looking at the ratio between the sum of the rowsOrdinary Lists and Query Rotation and the row Extended Lists. The values in the Ordinary Listsrow are the times taken to process a query, that is the time for step (I0). We have separatedthis from the time taken by the query rotation itself, because (I0) has to be performed anyway.This 2:1 ratio adds to the problem, mentioned above, that the internal posting list data structurewould have to be modified. This can be a major endeavor in a large system, depending on its design.

The query rotation approach, on the other hand, adds less than 1% to the query processing time,and does not require any modification of the internal posting list data structure.

4.2 Computing snippets positions

In this section we discuss the step (I2) of our snippet-generation method: given the positions ofeach query word in each of the top-ranked documents, compute the positions of the snippets (notyet the actual text, this is done in step (I3)).

Page 34: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

24 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

Scheme all 1-word 2-word ≥ 3-word

Ordinary Lists 144.0ms 17.9ms 46.7ms 174.9ms

Extended Lists 342.6ms 29.7ms 114.5ms 415.7ms

Query Rotation + 0.4ms + 0.1ms + 0.2ms + 0.5ms

Table 4.1: Times for query processing with ordinary lists and with extended lists, and theadditional time taken by the query rotation, on TREC Terabyte. The first column gives theaverage over 10,000 queries. The other columns provide the averages for fixed numbers of querywords.

Step (I2) performs the same operations for each of the top-ranked documents:It takes as input l sorted lists L1, . . . , Ll of positions (pertaining to the l query words), and asorted list S = s1, s2, . . . of positions (marking the segment beginnings). It provides as output alist of s tuples (segi, Posi), sorted by the segi, where segi is the index in S of the ith segmentcontaining a position from one of the Qi, and Posi is the list of those positions from L1, . . . , Llwhich fall into that segment, together with the information from which of the lists each positioncame.

Example: Consider a 2-word query and a fixed document with 5 segments. Let L1 = {3, 8, 87}(positions of the first query word), L2 = {13, 79} (positions of the second query word), andS = {1, 17, 43, 67, 98} (first positions of each segment). Then the correct output of I2 would bethe two tuples (1, (3, 1), (8, 1), (13, 2)) — segment 1 with three occurrences of the query words —and (4, (79, 2), (87, 1)) — segment 4 with two occurrences of the query words.Each tuple corresponds to a potential segment to be output and it remains to score the segments,and compute those with the m larges scores, for some user-defined m. In our experiments, wehave taken m = 3 and scored segments by a weighted sum of the number of query words in thesegment, the number of distinct query words in the segment, the longest contiguous run of querywords in the segment, and a few other quantities, just as described in Figure 2.1. Note that (I2)would work for any other score computed as a function of segi and Posi, too.

The algorithm for this step is simply a l -way merge/intersection, as presented in Figure 4.2. It isapplied for each of the top-ranked documents. The positions from the l position lists are addedto wlist which is sorted ascending. wlist is initialized by adding to it the first position from eachof the l lists. On each iteration the smallest element from wlist is extracted and a new positionis added from one of the l lists. This way, wlist will contain at any moment only l elements,making the sorting this list an efficient operation. By using the list of segment bounds S, it can bedetermined to which segment the positions in wlist belong. For each segment, S stores the lowerbound, so we can determine the lower and upper bound for each of the segments by traversing S.

Lemma: For each hit displayed to the user, step (I2) can be executed in time O((|L1| + · · ·

+|Ll|+ |S|) · log l).

Proof: (I2) can be implemented by a straightforward variant of an l-way merge, followed by asimple sort. We here assume that a segment can be scored in linear time.

Our experiments will show that the computation of (I2) takes negligible time compared to allother steps. This is understandable, since neither the Li nor S nor l are ever going to be very

Page 35: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4.2. COMPUTING SNIPPETS POSITIONS 25

Input: l lists of positions corresponding to the l query words and S, the list of segment boundsOutput: a list of scored tuples (one for each segment containing at least one position from the l lists)Initialize the output list output with the empty listInitialize the working list of positions wlist with the empty listInitialize the current segment csegm with the first segment of the documentFor each of the l lists of positions

Add the first element of the list to wlist (if there is one)Advance in the list one position

Sort wlist ascendingWhile wlist is not empty

Let pos be the first position from wlist, added from input list lstRemove pos from wlistAdd the current position from list lst to wlist (if there is one)Advance in the lst list one positionSort wlist ascendingIf pos belongs to the current segment csegm

If csegm is not marked to be outputMark csegm to be output

Add pos to the tuple corresponding to the current segmentElse

Add the tuple corresponding to the current segment to the output listDetermine the segment that contains pos using S and mark it as the current segment csegmAdd pos to the tuple corresponding to the current segmentMark csegm to be output

Return the output list output

Figure 4.2: The l-way merge / intersection algorithm for computing the positions of the snippets

Page 36: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

26 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

large, and we are only manipulating with numbers (positions) in this step. The bulk of the workin the index-based approach is turning these positions into text, as shown next.

4.3 From snippet positions to snippet text

The last step of our snippet generation method (I3) takes as input a list of positions, namely thepositions of the top-ranked segments computed in (I2). It provides as output the actual text atthese positions, which is then displayed to the user. The question is how to represent documentsso that this can be done as efficiently as possible.

We stress (again) that this final step, where the actual text gets generated is completely obliviousof the query words that actually gave rise to the position lists. This has the invaluable advantagethat any kind of advanced search feature that defines what constitutes a match has to beimplemented only once, on the query processing side.

4.3.1 Block representation

In the document-based approach, the whole (compressed) document needs to be read from disk andscanned in order to determine which segments match the query. For large documents, this is anefficiency problem. As a heuristic, [TTHW07] suggest to rearrange documents such that segmentsthat are likely to score high are at the beginning, and then consider only the head of each document.

For our approach, we could in principle read exactly the amount of text needed for the snippetsand no more. There would be a price for this, however: this potentially incurs one disk seek persegment, and we would need to store for every segment its offset in the document. To balancethese conflicting goals we take the following approach.

Each document is divided into blocks, where each block covers a fixed number of p positions (thelast block may cover less than p positions). Each of these blocks is stored compressed on disk.Together with each document we store the offsets of the blocks in the documents.

Then given the output of step (I2), we can easily identify the blocks containing these positions.Each of these blocks is then read from disk and decompressed. Within the blocks each indexedword is marked by a special character. The text corresponding to the given positions is theneasily obtained by a simple scan over the relevant blocks.

4.3.2 Compression of the blocks

To minimize the total size of our document database as well as the time to read blocks from disk,we store them compressed. Since our blocks are small, there is little value in encoding each tokenseparately, as proposed in [TTHW07]. We instead compress blocks as a whole, using zlib (withthe default compression level 6). As shown in Section 6.2, this gives us slightly better compressionthan in [TTHW07].

4.4 Caching

As we will see in Chapter 6, the times for steps (I2) and (I3) are dominated by the time for diskseeks: for seeking the list of the first positions of the segments, and for seeking the blocks to be

Page 37: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

4.5. INTEGRATION WITH THE COMPLETESEARCH ENGINE 27

read from disk.

A similar observation was made in [TTHW07], who then suggested two caching strategies,one static and one dynamic, with the goal of avoiding as many of the disk seeks as possible.Both of these strategies cached whole documents (because, as already discussed earlier, find-ing matching segments requires a scan over the whole document in the document-based approach).

Our approach trivially allows the caching of individual segments instead of whole documents. Thisis because (I2) outputs the indices of segments (which in turn were obtained from the invertedindex) to be output before we ever went to the actual documents. When a segment with index jfrom a document with id i is requested for the first time, it is fetched from disk, and stored inmemory under the key (i, j) (using a simple hash map). Whenever we request the segment withid pair (i, j) again, we fetch it directly from memory. When a pre-specified amount of memory,the capacity of the cache, is exceeded, we evict the least recently used (LRU) segment(s) fromthe cache until there is enough space again.

In this way, the index-based approach can potentially make much better use of the availablememory than whole-document caching. Note that storing only selected segments would be of nouse for the document-based approach, because for a new pair of query and document we do notknow which segments to retrieve before having scanned the whole document.

Turpin et al. address this problem by moving segments with likely high scores to the beginningof the documents. This, however, risks an occasional loss of relevant segments, which may notbe tolerable in some applications. Note that we can easily implement this heuristic with ourfine-grained cache, whereas the document-based approach would require additional code for a(static) extraction of promising segments. In fact, no performance measurements were providedfor this heuristic in [TTHW07], which suggests that they did not implement it.

4.5 Integration with the CompleteSearch engine

We have integrated our snippet generation algorithm with CompleteSearch, a search engine builton top of a novel data structure and query algorithm for context-sensitive prefix search andcompletion [BW07]. CompleteSearch supports query operators and advanced queries such aserror-tolerant search and semantic search, described in Section 3.2. The core data structure is apositional inverted index, conceptually identical to the one we presented in Section 3.1 and usedto explain our work. In the reminder of this section we describe the main steps of the integrationprocess in the order dictated by (I1), (I2) and (I3).

The first step for the integration was to use the information provided through CompleteSearch’spositional index, as input for the first step of our method, (I1). No modifications of the invertedindex were required for realizing this. We have used the output for processing a query, provided bythe engine, as input for step (I1), making this as efficient as possible by reusing information fromthe cache/history. More precisely, for a query of the general form Q1◦1 · · ·◦l−1Ql, CompleteSearchwill provide as result a posting list D for the top-ranked documents. In order to compute thislist, the individual posting lists for the query words, Q1 to Ql, must be loaded from the disk.Once these lists are loaded, CompleteSearch does not store the list for every individual queryword in the cache, but it does store the results for each prefix, i.e., for Q1, for Q1 ◦1 Q2, forQ1 ◦1 Q2 ◦2 Q3, etc. The lists corresponding to the prefixes can then be retrieved from the cache

Page 38: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

28 CHAPTER 4. INDEX-BASED SNIPPET GENERATION

as needed (assuming nothing else has been overwritten).

We have computed the positions for each of the query words in the top ranked documents bystarting from right to left. The positions for Ql would be found in the D posting list. Thepositions for Ql−1 are obtained by computing the intersection D ◦−1

l−1 (Q1 ◦1 · · · ◦l−2Ql−1). Let uscall the resulting posting list of this intersection Pl−1. At this step, we already have in the historythe result of Q1 ◦1 · · · ◦l−2Ql−1. The essential observation to make here is that no additional datais loaded from disk and therefore no disk seeks are involved (which are expensive operations).

In order to compute the positions for Ql−2 we need to compute Pl−1 ◦−1l−2 (Q1 ◦1 · · · ◦l−3 Ql−2).

Similarly to the previous computation, the result of Q1 ◦1 · · · ◦l−3 Ql−2 is already in the cache.The process goes on in the same way until we compute the positions of all query words. Thisway, the computation of step (I1) is very fast since we only intersect short posting lists that areretrieved from the history. We have already presented a fast intersection algorithm for postinglists in Subsection 4.1.2.

The second important step of the integration process was to provide the necessary input for step(I2) of our method. (I2) uses as input the output of step (I1) and the document database file,which stores all the documents of the collection. The document database file of CompleteSearchhad a different structure and content before integrating our snippet generation method. Thefirst important change here was to modify the structure of the file in order to include, for eachdocument, the corresponding segment bounds required as input for step (I2). This information isstored in the header of each document. The way documents are stored has been changed also: asalready described in Section 4.3, the document database now contains compressed blocks of text,such that each of these blocks can be retrieved and decompressed individually.

The last step of the integration process, required in order to accommodate step (I3), was tocombine the output from step (I2) with fast access to the text in the documents, necessary forproducing the snippets. This was a challenge in terms of designing the structure of the documentdatabase, as already discussed, but also in designing and integrating a snippet caching module.

Page 39: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 5

Index-based versusDocument-based SnippetGeneration

This chapter presents in detail the problems of the document-based approach and the way inwhich the index-based approach is able to overcome these problems. We argue that indexed-basedsnippet generation is superior to document-based snippet generation in several practically relevantrespects.

5.1 Non-literal matches

As discussed in Chapter 3, one of the central problems in text retrieval is the vocabulary mismatchproblem, that is, when a relevant document does not literally contain (one of) the query words. Acommon way to overcome the vocabulary mismatch problem is to index documents under termsthat are related to words in the document, but do not literally occur themselves.

A good example for this is the semantic query concept:city, which means that we are looking forall the names of cities in the collection and not for the occurrences of the word city. We havemarked this fact by prefixing the query with the concept keyword. If we have a collection thatcontains only Rome and Alexandria as names of cities and they have the following inverted lists:

• Alexandria

doc ids D64 D88positions 33 66word ids W42 W42scores 17 19

• Rome

29

Page 40: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

30 CHAPTER 5. INDEX-BASED VERSUS DOCUMENT-BASED SNIPPET GENERATION

doc ids D31 D87 D99positions 12 42 11word ids W67 W67 W67scores 3 10 12

Then, the inverted list for the term concept:city is composed of all the postings belonging to theprevious two terms:

• concept:city

doc ids D31 D64 D87 D88 D99positions 12 33 42 66 11word ids W67 W42 W67 W42 W67scores 3 17 10 19 12

Like this, there is a variety of advanced search features which call for non-literal matches: related-words search (like in our example), prefix search (for a query containing alg, find algorithm),error-correcting search (for a query containing algorithm, find the misspelling algoritm), se-mantic search (for a query musician, find the entity reference John Lennon), etc.

The task of how to identify such relations and enhance the index correspondingly is out of thescope of this work. The problem we consider here is: Given such enhancements, how to generatesnippets containing the corresponding non-literal matches.

The document-based approach needs additional effort to achieve this. An obvious extension wouldbe to add the additional terms not only to the index but also to the documents. This extensionhas several disadvantages: First, it complicates and slows down the procedure for extractingmatching segments from a document. Second, it blows up the space required for each document(which also affects snippet generation time, because the document has to be read from disk).Third, inserting additional terms into documents after the parsing, as required in many cases, ishard. Fourth, this is an instance of code duplication, as discussed as a separate issue below.The index-based approach does not have any of these problems. After the top-ranking documentsand corresponding positions have been computed in steps (I0) and (I1), steps (I1) and (I2) arecompletely oblivious of what the query words where, and only deal with positions.

5.2 Code duplication

A major system’s engineering problem of the document-based approach is that each and everysearch feature which defines what constitutes a hit has to be implemented twice: once for com-puting the ids of the best matching documents from the index, and then again for generatingappropriate snippets.

If we take as example the phrase query roman.emperor and the following text:

”The Roman Emperor was the ruler of the Roman State during the imperial period (starting atabout 27 BC). The Romans had no single term for the office: Latin titles such as imperator (fromwhich English emperor ultimately derives), augustus, caesar and princeps were all associated with

Page 41: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

5.2. CODE DUPLICATION 31

it”.

Then as results for this query, only the occurrences of roman followed immediately by emperorwill be considered (they are emphasized with bold characters in the text above). The non-adjacentoccurrences of roman and emperor (marked with italic characters) are not considered since ourquery contains the phrase operator between the two words. If we consider again the inverted listsof the two query words:

• roman

doc ids D7 D23 D47 D47 D63positions 6 15 26 38 87word ids W12 W12 W12 W12 W12scores 2 10 16 8 7

• emperor

doc ids D3 D7 D47 D63positions 90 31 27 81word ids W34 W34 W34 W34scores 9 10 16 1

The intersection procedure will simply check whether there are any consecutive positions for thecommon document ids in the two lists. In our example, only for document id D47 we have theconsecutive positions 26 and 27, so the result inverted list for our query is:

doc ids D47scores 32

The proximity operator needs to be considered in step (D0), just as presented in the previousexample. This way only those documents are considered as hits, where the two parts roman andemperor occur one after another. Then it needs to be considered again in step (D1), so that weonly extract (or at least prefer) snippets where the two parts of the query occur in proximity ofeach other. This means that the document must be scanned and all the occurrences of the querywords considered:

”The Roman Emperor was the ruler of the Roman State during the imperial period (starting atabout 27 BC). The Romans had no single term for the office: Latin titles such as imperator (fromwhich English emperor ultimately derives), augustus, caesar and princeps were all associatedwith it”.

While scanning the document, step (D1) must check, for each occurrence of the query wordroman, whether the next word is emperor. Only in this case the corresponding part of text willbe considered as a snippet. Since (D1) does not have access to the positional information used in(D0) — that is exactly the difference between the document-based and the index-based approach— very similar functionality has to be implemented twice. This is time-consuming, error-prone,and hard to maintain.

Page 42: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

32 CHAPTER 5. INDEX-BASED VERSUS DOCUMENT-BASED SNIPPET GENERATION

The same holds true for any query operator, simple ones like: phrase (require that two querywords are adjacent), proximity (a query word must be in the proximity of another), disjunction(find any one of a group of two or more query words, like in general|commander), as well as morecomplex ones like the join operator described in [BW07] (find all authors which have publishedin both SIGIR and SIGMOD).

The index-based approach does not have any of these problems. Every query operator has tobe implemented once, on the query processing side, such that document ids and positions arecomputed which satisfy the requirements of the respective operator. After that, (I2) and (I3) willcompute only with those valid positions.

5.3 Large documents

A less severe but still relevant problem of the document-based approach are very large documents.Since step (D1) has no access to the positions of the query words, in the worst case the wholedocument has to be scanned to find all required occurrences of the query words. A possibleremedy would be to build a small index data structure for every document, in order to be able tofind occurrences of query words more efficiently. In fact, the Lucene search engine, in its mostrecent version, does exactly this; see Chapter 2. But that incurs additional space and additionalimplementation work. In fact, it would again mean re-implementing functionality which, inprinciple, has already been realized for the index computation in step (D0).

The index-based approach does not have that problem, provided that we represent our documentsin a way that allows efficient retrieval of those parts covering a given list of positions. As describedin Chapter 4, our simple block-wise compression and storage scheme suffices.

5.4 Caching

As shown in [TTHW07] and re-validated in Chapter 6, the bulk of the snippet generation time isspent on disk seeks. Serving as many segments as possible from memory instead of from disk istherefore key to high performance snippet generation.

The document-based approach dictates a very coarse caching granularity: for each document eitherits whole text is cached or nothing at all. This is because finding the matching segments first re-quires a scan over the whole document (unless approximate solutions are tolerated; see Chapter 6).

The index-based approach trivially permits a much more fine-grained caching: step (I2) producesa request for the text of a particular segment without the need to retrieve parts of the documentfirst. It can therefore be decided individually for each segment whether to cache it or not.

Page 43: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 6

Experiments

We ran experiments to evaluate our implementation of index-based snippet generation, and inparticular compared it against the document-based approach.

All our experiments were run on a machine with two dual-core AMD Opteron processors (2.8GHz and 1 MB cache each), 16 GB of main memory, running Linux 2.6.20 in 32-bit mode. Allour code is single-threaded, that is, each run used only one core at a time. All our code is in C++and was compiled with g++ version 4.1.2, with option -O3. Before each run of an experiment,we repeatedly read a 16 GB file from disk, until read times suggested that it was entirely cachedin main memory (and thus all previous contents flushed from there).

6.1 Datasets and Queries

We selected three test collections and queries, each with a different characteristic:

DBLP Plus: 31,211 computer science articles listed in DBLP, with full text and meta data,and 20,000 queries extracted from a real query log, for example matrix..factor. The indexwas built with support for error correction and prefix search. All queries were run as proximityqueries, as in the example of Figure 1.1, where .. means that the query words must occur withinfive words of each other.

This collection was chosen to test our approach on advanced query operators, in this case,proximity search and boolean or, and for some forms of query expansion, in this case, prefixsearch and error correction.

Semantic Wikipedia: 2,843,056 articles from the English Wikipedia with an integrated ontol-ogy, as described in [BCSW07] and 500 semantic queries from [BCSW07], for example finlandcity, where city does not only match the literal word but also all instances of that class. Tocompute these matches, information from several documents needs to be joined ; for details, werefer to [BCSW07].

This collection was chosen to test our approach for queries that require a join of informationfrom two or more documents for each hit. The document-based approach would not be able toproduce query-dependent snippets in this case, because the hit document does not contain all therequired information to assess which words or phrases did actually match the query.

33

Page 44: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

34 CHAPTER 6. EXPERIMENTS

TREC Terabyte: 25,204,103 documents from the TREC GOV2 corpus and 100,000 queriesfrom the TREC 2006 efficiency track, for example: first landing moon. More informationabout the TREC GOV2 corpus can be found at http://www-nlpir.nist.gov/projects/terabyte.

This collection was chosen to test the efficiency of our approach on a very large collection andfor a large number of queries. Note that Turpin et al. experimented with the somewhat smallerwt100g collection (102 GB of raw data) combined with the somewhat larger Excite query log(535,276 queries).

6.2 Space consumption

We built the document database as described in Section 4.3, with each block covering 1000positions, which amounts to an average of around 4 KB of compressed text. This gave the bestsnippet generation times overall.

Collection Raw Data Index Docs DB Turpin

DBLP Plus 8.7 GB 0.47 GB 0.35 GB 0.42 GB

Wikipedia 12.7 GB 4.52 GB 3.41 GB 3.53 GB

Terabyte 426 GB 38.2 GB 36.3 GB 45.8 GB

Table 6.1: Raw size of the data (html, pdf, doc, text), size of the index (used for query processing),and size of the document databases (for snippet generation) for our implementation of the index-based approach (fourth column), and for Turpin et al’s implementation of the document-basedapproach (fifth column), on all three test collections.

Table 6.1 shows that the size of our document database is roughly equal to the size of the indexused for query processing (which is of a different kind for each of the collections, see Section 6.1).Our document database is slightly smaller than the one described by Turpin et al. [TTHW07].This is because in our index-based approach every block is compressed as a whole (using zlib),which compresses the original text1 to about 40%, whereas the id-wise compression of Turpin etal. achieves a compression ratio of only about 50%.

A part of our document database is auxiliary data needed to locate the document data withinthe single big file, as well as information on the segment bounds and the block offsets of eachdocument. This auxiliary data takes about 5% of the total size of the document database foreach of our three collections.

6.3 Snippet generation time

We ran our index-based snippet generation implementation on all three test collections, forthe queries described in Section 6.1. On the large Terabyte collection, we compared it to thedocument-based implementation of Turpin et al.

1We measure compression ratio with respect to the plain text, with all markup and tags removed. Turpin et al.measured their compression ratio with respect to the size of the original html, pdf, etc. files.

Page 45: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

6.3. SNIPPET GENERATION TIME 35

Collection I0 I1 I2 I3

DBLP Plus 45ms 0.2ms 6.0ms 13.7ms

Wikipedia 113ms 0.1ms 51.6ms 8.7ms

Terabyte 144ms 0.9ms 37.5ms 12.2ms

Table 6.2: Breakdown of the snippet generation times of our implementation of the index-basedapproach by steps I1, I2, and I3. Step I0 is the ordinary query processing, which provides the idsof the top-ranked documents. All entries are averages per query, that is, the times in the lastthree columns relate to the generation of 10 snippets at a time.

Collection I1 - I3 Turpin

Terabyte 54.6ms 49.5ms

Table 6.3: Snippet generation times of our implementation of the index-based approach comparedto Turpin et al’s implementation of the document-based approach.

The main observation from Table 6.2 is that the query rotation from step (I1), which providesthe basis for all other steps in the index-based approach, takes negligible time compared to allother steps. The bulk of the time spent for (I2) and (I3) is disk seek time, that is, the cost of oursnippet generation is essentially the cost of the required disk seeks. Turpin et al. came to thesame conclusion for their document-based approach. The time for (I2), which is mainly spent forseeking the list of segment boundaries on disk, is somewhat higher for Wikipedia than for theother datasets, because of the small number of queries: that way hardly any information wascached.

In fact, the table shows that, unlike query processing times, snippet generation times do notdepend on the size and nature of the collection, but rather on the number of queries. We alsomeasured a slight dependency on the average documents size (which is largest for DBLP Plus: 32kB, and smallest for Terabyte: 10 kB), which influences the length of the list of segment boundsread in step (I2).

Table 6.3 shows that for ordinary keyword queries our implementation of the index-based approachand Turpin et al’s implementation of the document-based approach are about equally fast. Notethat there is no way to be much faster for basic keyword search, but that the document-basedapproach simply cannot be used as is for more complex queries like those for DBLP Plus andWikipedia.

Worst-case snippet-generation times are much worse for the document-based approach, however:1,453 of the 420,316 queries on Terabyte required more than 1 second (due to very large documentsbeing involved), whereas for our index-based approach snippet generation never exceeded 100milliseconds. Turpin et al. suggested to read only the most significant segments from eachdocument at the price of the loss of exactness, but they did not evaluate the efficiency of thisheuristic.

Page 46: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

36 CHAPTER 6. EXPERIMENTS

6.4 Caching

As explained in Section 4.4, the index-based approach allows caching on the level of individualsegments, instead of just on the level of whole documents as for the document-based approach. Toevaluate the potential of this finer granularity, we evaluated the following two caching strategieson the Terabyte collection. For each strategy we used the first half of the queries to fill the cache,and then measured the hit ratio for the second half of the queries. The figures in Table 6.2 wereobtained using the whole-document strategy. The cache eviction strategy was LRU throughout,as explained in Section 4.4.

Top segments: Only those segments which were actually displayed to the user (those with thetop 3 scores in our case) were cached.Whole document: All segments from each of the top-ranked documents were cached. This isequivalent to the caching strategy described in Turpin et al.

Caching scheme 256 MB 512 MB 1024 MB

Top segments 74% (70MB) 74% (70MB) 74% (70MB)

Whole document 42% (256MB) 60% (512MB) 71% (1024MB)

Table 6.4: Cache hit ratios for two caching strategies and three different cache capacities, forTerabyte with the first half of the queries used for filling the cache. The numbers in parenthesesgive the total size of the cached items after all queries have been processed.

The fine-grained top-segments caching indeed gives a significantly better cache-hit ratio thanthe coarse whole-document caching, and the difference is especially pronounced (almost a factorof two) for small cache sizes. In fact, our number of queries was too small to make use oflarger cache sizes for the top-segments caching: only 70 MB are sufficient to store all snippetsreturned for all top-10 hits for all our 420,316 queries. For larger query logs (which we couldnot get hold of), we would expect the cache hit ratio for the top-segments approach to go up further.

In fact, we can also provide a good theoretical justification for the behavior we observe in Table6.4. We verified that documents returned as one of the top-10 hits, as well as segments returnedas one of the top-3 segments of these documents both follow a generalized Zipf distribution. Thatis, for a fixed hit / snippet, the probability of seeing the i-th most frequently seen document /segment is proportional to i−α, for some parameter α > 0. For the distribution of documents, wemeasured a parameter αD = 0.75 (this is in accordance with results reported in [BCF+99]). Forthe distribution of segments, we measured a parameter of αS = 0.80.

Now the asymptotic cache hit ratio of a cache that can hold m items, with n items appearingaccording to a Zipf distribution with parameter α < 1, can be approximated by c · (m/n)1−α,for some constant c > 0 [BCF+99]. For the last column in Table 6.4, the ratio m/n is about1/5 for both the top-segments and the whole-document cache. (Note that m and n increase bythe same factor, when going from caching whole documents to caching individual segments; thefactor is just the average number of segments in a document.) This suggests a factor 50.05 ≈ +9%difference in favor of the top-segments approach which agrees well with what we observe in Table6.4. Note that, as we already mentioned above, the hit ratio for the top-segments cache hascertainly not reached its asymptotic limit yet. Also note that the advantage increases as m/ndecreases, that is, as a smaller fraction of items fit into the cache.

Page 47: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Chapter 7

Conclusions and Future Work

In this thesis we have approached the problem of query-biased snippet generation, in particularthe efficiency aspects associated with it. We have developed and studied a novel index-basedapproach that makes use of the information from a standard positional inverted index. We showedthat this approach is as efficient as the standard document-based approach, but is superior inthree practically relevant aspects:

• (i) non-literal matches of query words with document text are not a problem

• (ii) advanced query operators need not be re-implemented on the snippet generation side

• (iii) snippet generation times do not degenerate for large documents because only thoseparts of the documents containing the snippets to be displayed will be read.

We have efficiently integrated our index-based snippet generation method with the Complete-Search engine and ran various experiments using three test collections: DBLP Plus, SemanticWikipedia and TREC Terabyte. We pointed out that index-based snippet generation allows amuch more fine-grained caching than the standard document-based approach. We found that thisfiner grained caching gives a significantly larger cache hit ratio, and we gave a simple theoreticalexplanation. It would be interesting to test the limits of the fine-grained caching with a muchlarger query log.

Throughout this work we took it for granted that query-dependent snippets are desirable andhelping retrieval effectiveness also for advanced queries (synonym search, error-tolerant search,semantic search, etc.) and advanced query operators (phrase search, proximity search, etc.). Itwould be interesting and important to evaluate this belief in a user study.

37

Page 48: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

38 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

Page 49: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

Bibliography

[AM06] Vo Ngoc Anh and Alistair Moffat. Pruned query evaluation using pre-computedimpacts. In 29th Conference on Research and Development in Information Retrieval(SIGIR’06), pages 372–379, 2006.

[AMS97] J. Zobel A. Moffat and N. Sharman. Text compression for dynamic documentdatabases. In Knowledge and Data Engineering, pages 302–313, 1997.

[BCF+99] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching andzipf-like distributions: Evidence and implications. In INFOCOM’99, pages 126–134,1999.

[BCSW07] Holger Bast, Alexandru Chitea, Fabian M. Suchanek, and Ingmar Weber. Ester:efficient search on text, entities, and relations. In 30th Conference on Research andDevelopment in Information Retrieval (SIGIR’07), pages 671–678, 2007.

[BJP05] A. Spink B.J. Jansen and J. Pedersen. A temporal comparison of altavista websearching. In J. Am. Soc. Inf. Sci. Tech. (JASIST), pages 559–570, 2005.

[BMS+06] Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and GerhardWeikum. Io-top-k: Index-access optimized top-k query processing. In 32nd Conferenceon Very Large Data Bases (VLDB’06), pages 475–486, 2006.

[BP98] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.In WWW7, pages 107–117, 1998.

[Buc04] Chris Buckley. Why current IR engines fail. In 27th Conference on Research andDevelopment in Information Retrieval (SIGIR’98), pages 584–585, 2004.

[BW07] Holger Bast and Ingmar Weber. The CompleteSearch engine: Interactive, efficient,and towards ir& db integration. In 3rd Biennial Conference on Innovative DataSystems Research (CIDR’07), pages 88–95, 2007.

[Cut] D. et al Cutting. Lucene. http://lucene.apache.org.

[GA07] J-L Gailly and M. Adler. Zlib compression library. In www.zlib.net, Accessed January2007.

[GKMC99] Jade Goldstein, Mark Kantrowitz, Vibhu O. Mittal, and Jaime G. Carbonell. Summa-rizing text documents: Sentence selection and evaluation metrics. In 22nd Conferenceon Research and Development in Information Retrieval (SIGIR’99), pages 121–128,1999.

39

Page 50: Index-based Snippet Generation - Max Planck Society€¦ · Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor

40 BIBLIOGRAPHY

[HL03] S. Ghemawat H.Gobioff and S. Leung. The google file system. In SOSP’03, pages29–43, 2003.

[IHWB99] Alistair Moffat Ian H. Witten and Timothy C. Bell. Managing Gigabytes: Compressingand Indexing Documents and Images, 2nd edition. Morgan Kaufmann, 1999.

[KPC95] J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. In SIGIR95,pages 68–73, 1995.

[Luh58] H.P. Luhn. The automatic creation of literature abstracts. In IBM Journal, pages159–165, 1958.

[SSJ01] T. Sakai and K. Spark-Jones. Generic summaries for indexing in information retrieval.In SIGIR 2001, pages 190–198, 2001.

[TS98] Anastasios Tombros and Mark Sanderson. Advantages of query biased summaries ininformation retrieval. In 21st Conference on Research and Development in InformationRetrieval (SIGIR’98), pages 2–10, 1998.

[TTHW07] Andrew Turpin, Yohannes Tsegay, David Hawking, and Hugh E. Williams. Fastgeneration of result snippets in web search. In 30th Conference on Research andDevelopment in Information Retrieval (SIGIR’07), pages 127–134, 2007.

[VH06] Ramakrishna Varadarajan and Vagelis Hristidis. A system for query-specific documentsummarization. In 15th Conference on Information and Knowledge Management(CIKM’06), pages 622–631, 2006.

[WJR03] Ryen White, Joemon M. Jose, and Ian Ruthven. A task-oriented study on theinfluencing effects of query-biased summarisation in web searching. InformationProcessing Management, 39(5):707–733, 2003.

[WRJ02] Ryen White, Ian Ruthven, and Joemon M. Jose. Finding relevant documents usingtop ranking sentences: an evaluation of two alternative schemes. In 25th Conferenceon Research and Development in Information Retrieval (SIGIR’02), pages 57–64,2002.