special topics in computer science the art of information retrieval chapter 8: indexing and...

28
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching Alexander Gelbukh www.Gelbukh.com

Upload: bailey-johnston

Post on 10-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Text transformation: meaning instead of stringso Lexical analysis

o Stopwords

o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems

Text compressiono Searchable (compress the query, then search)

o Random access

o Word-based statistical methods (Huffman)

Index compression

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

All computational linguisticso Improved POS tagging

o Improved WSD

Uses of thesauruso for user navigation

o for collating similar terms

Better compression methodso Searchable compression

o Random access

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

4

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

5

Types of searchingTypes of searching

Sequentialo Small texts

o Volatile, or space limited

Indexedo Semi-static

o Space overhead

First, we discuss indexed searching, then sequential

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

6

Inverted filesInverted files

Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)

o positions (word, char), files, sections...

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

7

Compression: Block addressingCompression: Block addressing

Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)

o Equal size (faster search) or logical sections (retrieval units)

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

8

Searching in inverted filesSearching in inverted files

Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search

Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)

o Boolean operations. Context search Merging One list is shorter (Zipf law)

Only inverted files allow sublinear both space & timeSuffix trees and signature files don’t

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

9

Building inverted file: 1Building inverted file: 1

Infinite memory? Use trie to store vocabulary

o append positions

O(n)

Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

10

Building inverted file: 2Building inverted file: 2

Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M)

Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n

Very fast creating/maintenance

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

11

Suffix treesSuffix trees

Text as one long string. No words.o Genetic databases

o Complex queries

o Compacted trie structure

o Problem: space

For text retrieval, inverted files are better

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

12

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

13

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

14

Suffix arraySuffix array

All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

15

Searching. ConstructionSearching. Construction

Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)

Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details

Addition: n n' log (M)/M Deletion: n

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

16

Signature filesSignature files

Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops!

o Design of the hash function

o Have to traverse the block

Good to search ANDs or proximity querieso bit patterns are ORed

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

17

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

18

Boolean operationsBoolean operations

Merging file (occurrences) listso AND: to find repetitions

According to query syntax tree Complexity linear in intermediate results

o Can be slow if they are huge

There are optimization techniqueso E.g.: merge small list with a big one by searching

o This is a usual case (Zipf)

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

19

Sequential searchSequential search

Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined!

o If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once

Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

20

Approximate string matchingApproximate string matching

Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic

o Convert to deterministic: O(n), but huge structure

o Bit-parallel: O(n), the fastest known

Filtering: sublinear!o k errors cannot alter k segments

o multipattern exact search; detect suspicious places

o uses approximate algorithm only when needed

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

21

Regular expressionsRegular expressions

Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns

o Bit-parallel (simulates non-deterministic)

Using indices to search for words with errorso Inverted files: search in vocabulary, then each word

o Suffix trees and Suffix arrays: the same algorithms!

Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

22

Structural queriesStructural queries

Ad-hoc index for structure Indexing tags as words

o Inverted files are goodsince they store occurrences in order

Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

23

Search over compressionSearch over compression

Improves both space AND time (less disk operations) Compress query and search

o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)

o Search each word in the vocabulary its code

o More sophisticated algorithms

Compressed inverted files: less disk less time

Text and index compression can be combined

Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

24

...compression...compression

Suffix trees can be compressed almost to size ofsuffix arrays

Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order

o almost the same compression

Signature files are sparse, so can be compressedo ratios up to 70%

Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

25

Page 26: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

26

Research topicsResearch topics

Perhaps, new details in integration of compression and search

“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular

o Search with or without synonyms

Page 27: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

27

ConclusionsConclusions

Inverted files seem to be the best option Other structures are good for specific cases

o Genetic databases

Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching

Compression can be integrated with search

Page 28: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

28

Thank you!Till compensation

lecture?