modern information retrieval

22
Modern Information Retrieval Chapter 8 Indexing and Searching

Upload: deepak

Post on 23-Jan-2016

42 views

Category:

Documents


1 download

DESCRIPTION

Modern Information Retrieval. Chapter 8 Indexing and Searching. It is worthwhile building and maintaining an index when the text collection is large and semi-static semi-static: not often updated consider search cost, space overhead, construction cost, and maintenance cost. Inverted file - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modern Information Retrieval

Modern Information Retrieval

Chapter 8 Indexing and Searching

Page 2: Modern Information Retrieval

It is worthwhile building and maintaining an index when the text collection is large and semi-static semi-static: not often updated consider search cost, space overhead,

construction cost, and maintenance cost

Page 3: Modern Information Retrieval

Inverted file a word-oriented index vocabulary: the set of all different

words in the text occurrences: lists of the text positions

where the words appear the positions can refer to words or

characters

Page 4: Modern Information Retrieval
Page 5: Modern Information Retrieval

the space required for the vocabulary is rather small while the occurrences demand much more space between 30% and 40% of the text size block addressing reduces space overhead

to 5%

Page 6: Modern Information Retrieval

if the exact occurrence positions are required, an online search over the qualifying blocks has to be performed

Page 7: Modern Information Retrieval

searching the inverted filevocabulary search: the words

present in the query are separately searched in the vocabulary

retrieval of occurrences: the lists of the occurrences of all the words found are retrieved

Page 8: Modern Information Retrieval

manipulation of occurrences: the lists are traversed in synchronization to find places where all the words appear in sequence for a phrase query or appear close enough for a proximity query

how to efficiently manipulate the occurrences when block addressing is used?

Page 9: Modern Information Retrieval

constructing the inverted file

Page 10: Modern Information Retrieval

once constructed, it is written to disk in two files

the lists of occurrences are stored contiguously in the first file

in the second file, the vocabulary is stored in lexicographical order with a pointer for each word to its list in the first file

Page 11: Modern Information Retrieval

Suffix tree and suffix array can be used to index any text character allow to answer efficiently more

complex queries index points are selected form the text,

which point to the beginning of the text positions which will be retrievable each position is considered as a text suffix each suffix is uniquely identified by its

position

Page 12: Modern Information Retrieval
Page 13: Modern Information Retrieval

a suffix tree is a trie data structure built over all the suffixes of the textthe pointers to the suffixes are

stored at the leaf nodesthe trie is compacted into a Patricia

tree where unary paths are compressed

an indication of the next character position to consider is stored at the nodes which root a compressed path

Page 14: Modern Information Retrieval

space overhead: 120% to 240% over the text size

Page 15: Modern Information Retrieval

suffix arrays provide the same functionality with much less space requirements An array containing all the pointers to the

suffixes in lexicographical order

space requirements close to 40% overhead

Page 16: Modern Information Retrieval

allow binary searches done by comparing the contents of each pointer

supra-index over the suffix array is used to reduce the number of disk accesses

compare with an inverted file

Page 17: Modern Information Retrieval

processing phrase queries by searching the first words of the phrases

processing proximity queries by searching all the words in the queries post-processing needed

Page 18: Modern Information Retrieval

Signature files use a hash function to map words to

bit masks of B bits a text is divided in blocks of b words

each a bit mask of size B is assigned to

each block by bitwise ORing the signatures of all the words in the block

Page 19: Modern Information Retrieval

if a word is present in a block, all the bits set in its signature are also set in the bit mask of the block

when a bit is set in the mask of the query word but not in the mask of the block, the word is not present in the block

Page 20: Modern Information Retrieval
Page 21: Modern Information Retrieval

false drop: all the corresponding bits are set while the word is not in the block

signature file design principle: make the probability of a false drop low while keeping the signature file as short as possible

searching a single word by hashing it to a bit mask W, checking whether

, and verifying if the word is actually thereWBiW

Page 22: Modern Information Retrieval

process a phrase searching by bitwise ORing the signatures of all the words in the querythe probability of false drops is

reducedcare has to be exercised at block

boundaries by overlapping words in consecutive blocks