string algorithms and data structures (or, tips and tricks for index design)
DESCRIPTION
String algorithms and data structures (or, tips and tricks for index design). Paolo Ferragina Università di Pisa, Italy [email protected]. String algorithms and data structures (or, tips and tricks for index design) Paolo Ferragina. An overview. Why string data are interesting ?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/1.jpg)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo FerraginaUniversità di Pisa, [email protected]
![Page 2: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/2.jpg)
An overview
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 3: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/3.jpg)
Why string data are interesting ?
They are ubiquitous: Digital libraries and product catalogues Electronic white and yellow pages Specialized information sources (e.g. Genomic or Patent
dbs) Web pages repositories Private information dbs ...
String collections are growing at a staggering
rate:
...more than 10Tb of textual data in the web...more than 15Gb of base pairs in the genomic dbs
![Page 4: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/4.jpg)
Some figures
0
20
40
60
80
Jan 95 Jan 96 Jan 97 Jan 98 Jan 99 Jan 00
Internet host (in millions)
10
100
1.000
10.000
100.000
Mar 95 Mar 96 Mar 97 Aug 98 Feb 99
Textual data on the Web (in Gb)
“Surface” Web: about 2550 Tb 2.5 billions of documents (7.3 millions per day)
“Deep” Web: about 7.500 Tb 4.200 Tb of interesting textual data
Mailing List: about 675 Tb (every year) 30 millions of msg per day, within 150,000 mailing lists
![Page 5: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/5.jpg)
Tag names and their nesting are defined by users
XML data storage (W3C project since ‘96)
An XML document is a simple piece of text containing some mark-up that is self-describing, follows some ground rules and is easily readable by humans and computers.
Tags come in pairs and are possibly nested
Data may be irregular, heterogeneousand/or incomplete
<?xml version=“1.0” ?><report_list> <weather-report> <date> 25/12/2001 </date> <time> 09:00 </time> <area> Pisa, Italy </area> <measurements> <skies> sunny </skies> <temp scale=“C”> 2 </temp> </measurements> </weather-report> …</report_list>
It is text based and platform independent
![Page 6: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/6.jpg)
Queries might exploit the tag structure to refine, rank and specialize the retrieval of the answers. For example:
Proximity may exploit tag nesting<author> John Red </author><author> Jan Green </author>
Word disambiguation may exploit tag names<author> Brown … </author> <university> Brown … </university>
<color> Brown … </color> <horse> Brown … </horse>
Great opportunity for IR…
HTML for publishing
relationaldata
XSL
Search
XML structure is usually represented as a set of paths (strings?!?)
XML queries are turned into string queries: /book/author/firstname/paolo
New
Scen
ario
XML storage
![Page 7: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/7.jpg)
The need for an “index” Brute-force scanning is not a viable approach:
– Fast single searches– Multiple simple searches for complex queries
In computer science an index is a persistent data structure that allows to focus the search for a query string (or a set of them) on a provably small portion of the data collection.
The American Heritage Dictionary defines index as followsAnything that serves to guide, point out or otherwise facilitate reference, as:
(a) An alphabetized listing of names, places, and subjects included in a printed work that gives for each item the page on which it may be found;
(b) A series of notches cut into the edges of a book for easy access to chapters or other divisions;
(c) Any table, file or catalogue.
![Page 8: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/8.jpg)
What else ?
The index is a basic block of any IR system.
An IR system also encompasses:
– IR models– Ranking algorithms– Query languages and operations– User-feedback models and interfaces– Security and access control management– ...
We will concentrate only on “index design” !!We will concentrate only on “index design” !!
![Page 9: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/9.jpg)
Goals of the Course
Learn about:– Model and framework for evaluating string data structures and
algorithms on massive data sets » External-memory model
» Evaluate the complexity of Construction and Query operations
– Practical and theoretical foundations of index design» The I/O-subsystem and other memory levels» Types of queries and indexed data» Space vs. time trade-off» String transactions and index caching
– Engineering and experiments on interesting indexes
» Inverted list vs. Suffix array, Suffix tree and String B-tree» How to choreograph compression and indexing: the new frontier !
Dichotomy between• Word-based indexes• Full-text indexes
MORAL: No clear winner among these data MORAL: No clear winner among these data structuresstructures !!!!
![Page 10: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/10.jpg)
Model and Framework
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 11: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/11.jpg)
Why do we care of disks ?In the last decade
– Disk performance + 20% per year
– Memory performance + 40% per year – Processor performance + 55% per year
Mechanical deviceElectronic devices
3 10 Mb/s in practice
Current performance– Disk SCSI 10 80 Mb/s– Disk ATA/EIDE 3 33 Mb/s
– Rambus memory 2 Gb/s
– Disk 7 millisecs– Memory 20 90 nanosecs– Processor few Ghz
Bandwidth
Access time
significant GAP between memory vs. disk performance
![Page 12: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/12.jpg)
The I/O-model [Aggarwal-Vitter ‘88]
D
M
Blo
ck
I/O
P
K= # strings in ’s collectionN = total # of characters in stringsB = # chars per disk pageM = # chars fitting in internal memory
Model parameters
To take care of disk seek and bandwidth, we sometime distinguish between:
• Bulk I/Os: fetching cM contiguous data• Random I/Os: any other type of I/O
Model refinement
Algorithmic complexity is therefore evaluated as:
• Number of random and bulk I/Os• Internal running time (CPU time)• Number of disk pages occupied by the index or during algorithm execution
![Page 13: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/13.jpg)
Two families of indexes
Types of data
Linguistic or tokenizable textRaw sequence of characters or bytes
Word-based queryCharacter-based query
Types of query
Two indexing approaches :• Word-based indexes, here a concept of “word” must be devised !
» Inverted files, Signature files or Bitmaps.
• Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.
DNA sequencesAudio-video filesExecutables
Arbitrary substringComplex matches
Exact wordWord prefix or suffixPhrase
![Page 14: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/14.jpg)
Word-based indexes
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 15: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/15.jpg)
Inverted files (or lists)
Now is the timefor all good men
to come to the aidof their country
Doc #1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc #2
Query answering is a two-phase process: midnight AND time
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Vocabulary Postings
2
![Page 16: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/16.jpg)
Some thoughts on the Vocabulary Concept of “word” must be devised
– It depends on the underlying application– Some squeezing: normal form, stop words, stemming, ...
Its size is usually small– Heaps’ Law says V = O( N), where N is the collection size
– is practically between 0.4 and 0.6 Implementation
– Array: Simple and space succinct, but slow queries– Hash table: fast exact searches– Trie: fast prefix searches, but it is more complicated– Full-text index ?!? Fast complex searches.
Compression ? Yes, speedup factor of two on scanning !!
– Helps caching and prefetching– Reduces amount of processed data
![Page 17: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/17.jpg)
Some thoughts on the Postings GranularityGranularity or accurancyaccurancy in word location:
– Coarse-grained: keep document numbers– Moderate-grained: keep the numbers of the text blocks– Fine-grained: keep word or sentence numbers
Space less than 20%Slow queries: Post-filtering
Space around 60%Fast queries and precision
An orthogonal approach to space saving: Gap Gap coding !!coding !!
– Sort the postings for increasing document, block or term number – Store the differences between adjacent posting values (gaps)– Use variable-length encodings for gaps: -code, Golomb, ...
Continuation bit: given bin(x) = 101001000001
It is byte-aligned, tagged, and self-synchronizing
Very fast decoding and small space overhead (~ 10%)
padding
00
tagging
1 0
88
10100 1000001
77
![Page 18: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/18.jpg)
Vocabulary turns complex text searches into exact block searches
A generalization: GlimpseGlimpse [Wu-Manber, 94]
Text collection divided into blocks of fixed size b– A block may span two or more documents– Postings = block numbers
Two types of space savings– Multiple occurrences in a block are represented only once– The number of blocks may be set to be small Postings list is small, about 5% of the collection size Under IR laws, space and query time are o(n) for a proper b
Query answering is a three-phase process:– Query is matched against the vocabulary: word matchings– Postings lists of searched words are combined: candidate blocks– Candidate blocks are examined to filter out the false matches
Full-scan orsuccinct index ?
Fine-graned b Coarse-grained
![Page 19: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/19.jpg)
Other issues and research topics... Index construction
– Create doc-term pairs < d,t > sorted by increasing d;– Mergesort on the second component t;– Build Postings lists from adjacent pairs with equal t.
In-place block permuting for page-contiguous postings lists.
Document numbering– Locality in the postings lists improves their gap-coding– Passive exploitation: Integer coding algorithms– Active exploitation: Reordering of doc numbers [Blelloch et al.,
02] XML “native” indexing
– Tags and attributes indexed as terms of a proper vocabulary– Tag nesting coded as set of nested grid intervals
Structural queries turned into boolean and geometric queries !
Our project: XCDE Library, compression + indexing for XML !!
![Page 20: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/20.jpg)
DBMS and XML (1 of 2)
Main idea:– Represent the document tree via tuples or set of objects;– Select-from-where clause to navigate into the tree;– Query engine use standard join and scan;– Some additional indexes for special accesses;
Advantages:– Standard DB engines can be used without migration;
– OO easily holds a tree structure;– Query language is well known: SQL or OQL;
– Query optimiser well tuned;
![Page 21: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/21.jpg)
DBMS and XML (2 of 2)
General disadvantages:– Query navigation is costly, simulated via many joins;
– Query optimiser looses knowledge on XML nature of the document;– Fields in tables or OO should be small;– Need extra indexes for managing effective path queries
Disadvantages in the relational case: (Oracle 8i/9i)
– Impose a rigid and regular structure via tables;– Number of tables is high and much space is wasted;– Do exist translation methods but error-prone and DTD is needed.
Disadvantages in the OO case: (Lore at Stanford university)
– Objects are space expensive, many OO features unused; – Management of large objects is costly, hence search is slow.
![Page 22: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/22.jpg)
XML native storage
The literature offers various proposals:
Xset, Bus: build a DOM tree in main memory at query time; XYZ-find: B-tree for storing pairs <path,word>;
Fabric: Patricia tree for indexing all possible paths;
Natix: DOM tree is partitioned into disk pages (see e.g. Xyleme); TReSy: String B-tree large space occupancy;
Some commercial products: Tamino,… (no details !)
Three interesting issues…
1. Space occupancy is usually not evaluated (surely it is 3) !2. Data structures and algorithms forget known results !3. No software in the form of a library for public use !
![Page 23: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/23.jpg)
XCDE Library: Requirements
XML documents may be:
– strongly textual (e.g. linguistic texts);
– only well-formed and may occur without a DTD;
– arbitrarily nested and complicated in their tag structure;
– retrievable in their original form (for XSL, browsers,…).
The library should offer:
1. Minimal space occupancy (Doc + Index ~ original doc size);
space critical applications: e.g. e-books, Tablets, PDAs !
2. State-of-the-art algorithms and data structures;
3. XML native storage for full control of the performance;
4. Flexibility for extensions and software development.
![Page 24: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/24.jpg)
XCDE Library: Design Choices
Single document indexing:– Simple software architecture;
– Customizable indexing on each file (they are heterogeneous);
– Ease of management, update and distribution;
– Light internal index or Blocking via XML tagging to speed up query;
Full-control over the document content:– Approximate or Regexp match on text or attribute names and values;
– Partial path queries, e.g. //root_tag//tag1//tag2, with distance;
Well-formed snippet extraction:
– for rendering via XSL, Braille, Voice, OEB e-books, …
![Page 25: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/25.jpg)
XCDE Library: The structure
Disk
XCDE Library
XML Query
Optimizer
Data engineAPI
Context engineText engine Tag engine
Con
sole
Query engine
API
Snippetextractor
Text query solver
Tag-Attributequery solver
![Page 26: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/26.jpg)
Full-text indexes
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 27: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/27.jpg)
The prologue
Their need is pervasive:– Raw data: DNA sequences, Audio-Video files, ...– Linguistic texts: data mining, statistics, ... – Vocabulary for Inverted Lists– Xpath queries on XML documents– Intrusion detection, Anti-viruses, ...
Four classes of indexes: Suffix array or Suffix tree Two-level indexes: Suffix array + in-memory Supra-
index B-tree based data structures: Prefix B-tree String B-tree: B-tree + Patricia trie
Our lecture consists of a tour through these tools !!
![Page 28: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/28.jpg)
Basic notation and factsPattern P[1,p] occurs at position i of T[1,n]
iff P[1,p] is a prefix of the suffix T[i,n]
TPi
T[i,n]
Occurrences of P in T = All suffixes of T having P as a prefix
T = This is a visual example This is a visual example This is a visual example
3,6,12
SUF(T) = Sorted set of suffixes of T
SUF() = Sorted set of suffixes of all texts in
![Page 29: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/29.jpg)
Two key properties [Manber-Myers, 90]
Prop 1. All suffixes in SUF(T) having prefix P are contiguous.Prop 2. Starting position is the lexicographic one of P.
P=si
T = mississippi#
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)
Suffix Array• SA: array of ints, 4N bytes• Text T: N bytes 5N bytes of space occupancy
(N2) space
SA
121185211097463
T = mississippi#
suffix pointer
5
![Page 30: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/30.jpg)
Searching in Suffix Array [Manber-Myers, 90]
Indirected binary search on SA: O(p log2 N) time
T = mississippi#SA
121185211097463
si
P is larger
2 accesses for binary step
![Page 31: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/31.jpg)
Searching in Suffix Array [Manber-Myers, 90]
Indirected binary search on SA: O(p log2 N) time
T = mississippi#SA
121185211097463
si
P is smaller
![Page 32: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/32.jpg)
Listing the occurrences [Manber-Myers, 90]
Brute-force comparison: O(p x occ) time
T = mississippi# 4 6 7SA
121185211097463
si
occ=2
Suffix Array search• O (p (log2 N + occ)) time
• O (log2 N + occ) in practice
External memory• Simple disk paging for SA
• O ((p/B) (log2 N + occ)) I/Os
issippiP is not a prefix
P is a prefixsissippi
P is a prefixsippi
logB N+ occ/B
121185211097463
121185211097463
![Page 33: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/33.jpg)
SA
121185211097463
Lcp
00140010213
Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA
Output-sensitive retrieval
T = mississippi# 4 6 7
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)
P=si
Suffix Array search• O ((p/B) log2 N + (occ/B)) I/Os• 9 N bytes of space
Scan Lcp until Lcp[i] < P
+ : incremental search
base B : tricky !!
Compare against P
00140010213
121185211097463
121185211097463
00140010213
occ=2
![Page 34: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/34.jpg)
q
Incremental search (case 1)
Incremental search using the LCP array: no rescanning of pattern chars
SA
Ran
ge M
inim
a
i
j
The cost: O (1) memory accesses
PMin Lcp[i,q-1]
> P’s
known induct.
< P’s
> P’s
![Page 35: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/35.jpg)
q
Incremental search (case 2)
Incremental search using the LCP array: no rescanning of pattern chars
SA
i
j
Min Lcp[i,q-1]
The cost: O (1) memory accesses
Ran
ge M
inim
a
known induct.
< P’s
> P’s
P
![Page 36: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/36.jpg)
Incremental search (case 3)
Incremental search using the LCP array: no rescanning of pattern chars
SA
i
j
q
Min Lcp[i,q-1]
Suffix char > Pattern char
Suffix char < Pattern char
Suffix Array search• O(log2 N) binary steps• O(p) total char-cmp for routing O((p/B) + log2 N + (occ/B)) I/Osbase B : more tricky
Note that SA is static
L
The cost: O(L) char cmp
Ran
ge M
inim
a
< P’s
> P’s
P
![Page 37: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/37.jpg)
SA
Hybrid Index
Exploit internal memory: sample the suffix array and copy something in memory
M
Disk
P
binary-search inside
SA + Supra-index• O((p/B) + log2 (N/s) + (occ/B)) I/Os
Parameter s depends on M and influences both performance and space !!
s
Copy a prefix ofmarked suffixes
![Page 38: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/38.jpg)
The suffix tree [McCreight, ’76]
It is a compacted trie built on all text suffixes
T = abababbc# 1 3 5 7 9
3
24
c
cc
b
a
bb
b
bb
a2
4
13 5
ab
c
a
b
b
c
a
b
c
bP = ba
cb
0
1
(5,8)O(N) space
O(p) time
What about ST in external memory ?– Unbalanced tree topology – Dinamicity
CPAT tree ~ 5N on average
No (p/B), possibly no (occ/B), mainly static and space costly
76 8
c
Packing ?! (p) I/Os(occ) I/Os??
Search is a path traversal
24
and O(occ) time
- Large space ~ 15N
b
a
![Page 39: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/39.jpg)
The String B-tree (An I/O-efficient full-text index !!)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 40: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/40.jpg)
The prologue
We are left with many open issues:– Suffix Array: dinamicity– Suffix tree: difficult packing and (p) I/Os– Hybrid: Heuristic tuning of the performance
B-tree is ubiquitous in large-scale applications:– Atomic keys: integers, reals, ...– Prefix B-tree: bounded length keys ( 255 chars)
Suffix trees + B-trees ? String B-tree [Ferragina-Grossi, 95]
Index unbounded length keys Good worst-case I/O-bounds in search and update Guaranteed optimal page-fill ratio
![Page 41: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/41.jpg)
Some considerations
Strings have arbitrary length:– Disk page cannot ensure the storage of (B) strings– M may be unable to store even one single string
String storage:– Pointers allow to fit (B) strings per disk page– String comparison needs disk access and may be
expensiveString pointers organization seen so far:
Suffix array: simple but static and not optimal Patricia trie: sophisticated and much efficient
(optimal ?)Recall the problem: is a text collection
Search( P[1,p] ): retrieve all occurrences of P in ’s texts
Update( T[1,t] ): insert or delete a text T from
![Page 42: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/42.jpg)
1º step: B-tree on string pointers
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
P = ATO((p/B) log2 B) I/Os
O(logB N) levels
Search(P) •O ((p/B) log2 N) I/Os•O (occ/B) I/Os
It is dynamic !! O(t (t/B) log2 N) I/Os
+ B
![Page 43: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/43.jpg)
2º step: The Patricia trie
1
AGAAGA
AGAAGG
AGAC
GCGCAGA
GCGCAGG
GCGCGGA
GCGCGGGA
65
3
6
4
0
Disk
AG
A
4 5 6 72 3
(1; 1,3) (4; 1,4)
(6; 5,6)(5; 5,6)(2; 1,2) (3; 4,4)
(1; 6,6) (2; 6,6)(4; 7,7) (5;
7,7)(7; 7,8)
(6; 7,7)
G
![Page 44: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/44.jpg)
2º step: The Patricia trie
1
AGAAGA
AGAAGG
AGAC
GCGCAGA
GCGCAGG
GCGCGGA
GCGCGGGA
65
3
Disk
A0
G
4 5 62 3
A
AA
A4
G
A6
7
GGG
C
Space PT• O(k) , no O(N) Two-phase search:
P = GCACGCAC
G
1
G5 7
A
Search(P):• First phase: no string access• Second phase: O(p/B) I/Os
mismatch
P’s position
max LCPwith P
Just one string is
checked !!
GC
![Page 45: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/45.jpg)
3º step: B-tree + Patricia tree
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PT
Search(P) •O((p/B) logB N) I/Os•O(occ/B) I/Os
Insert(T)O(t (t/B) logB N) I/Os
O(p/B) I/Os
O(logB N) levels
P = AT
+
![Page 46: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/46.jpg)
PTLevel i
Search(P) • O(logB N) I/Os just
to go to the leaf level
4º step: Incremental Search
Max_lcp(i)
P
Level i+1 PT PT
First case
Leaf Level
adjacent
![Page 47: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/47.jpg)
Level i+1 PT
InductiveStep
PTLevel i
Search(P) • O(p/B + logB N) I/Os• O(occ/B) I/Os
4º step: Incremental Search
Max_lcp(i)
P
Max_lcp(i+1)
skip Max_lcp(i)
i-th step:
O(( lcp i+1 – lcp i )/B + 1) I/Os
No rescanning
Second case
![Page 48: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/48.jpg)
In summaryString B-tree performance: [Ferragina-Grossi,
95]
– Search(P) takes O(p/B + logB N + occ/B) I/Os
– Update(T) takes O( t logB N ) I/Os– Space is (N/B) disk pages
Using the String B-tree in internal memory:– Search(P) takes O(p + log2 N + occ) time
– Update(T) takes O( t log2 N ) time– Space is (N) bytes It is a sort of dynamic suffix array
Many other applications:– String sorting [Arge et al., 97] – Dictionary matching [Ferragina et al., 97]– Multi-dim string queries [Jagadish et al., 00]
![Page 49: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/49.jpg)
Algorithmic Engineering (Are String B-trees appealing in practice ?)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 50: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/50.jpg)
Preliminary considerations
Given a String B-tree node , we define:
– S = set of all strings stored at node
– b = maximum size of S
An interesting property:
– H grows as logb N, and does not depend on ’s structure
– b is related to the space occupancy of PT, and b < B The larger is b, the faster are search and update operations
Our Goal: Squeeze Our Goal: Squeeze PTPT as much as possible as much as possible
![Page 51: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/51.jpg)
PT implementation
Node actually contains (let k=|S|):– PT = Patricia trie indexing the k strings of S
– The pointers to the k/2 children of – Some auxiliary and bookeping information
3 or 4 bytes
negligible
If the strings are binary then PT constists of:
– k leaves, pointing to S‘s strings
– (k-1) internal nodes, each storing an integer value– (2k-1) arcs, each storing one single char
Implementing PT takes: [Ferragina-
Grossi, 96]
– 12k bytes, via a pointer-based solution
– 9k bytes via a proper encoding of the binary-tree structure
![Page 52: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/52.jpg)
Some details and results
Experiments have shown that: [Ferragina-Grossi, 96]
– Search(P) – It takes about 2H disk accesses (as the worst-case
bound)
– It is 10 times faster than Suffix Array search– Comparable to Suffix Tree search
– Insert(T), via a batched insertion– It is 5 times faster than UNIX Prefix B-trees
– Better page-fill ratio than Suffix trees
Two limitations:– Space usage of 9N is too much– The update ops are CPU-bounded
![Page 53: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/53.jpg)
An experiment
0
10
20
30
40
50
60
1 2 4 8 16 32 64 128
Archive Size (Mb)
Num
ber
of I
/Os
Suffix Array
String B-tree
![Page 54: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/54.jpg)
A new proposalImplementing the node :
– String pointers and child pointers in 4 bytes – Integers in the nodes of PT stored via Continuation Bit
– Experiments showed that 90% are very small 1 byte
» How do we implement PT?! – Should be space succinct and allow basic
navigational opsSome results on the succinct coding of binary trees:
– Optimal k+o(k) bits and basic navigational ops [Jacobson, 89]
– 2k+o(k) bits and more navigational ops [Munro et al., 99]
Two specialties of our context:– PT is small, about a thousands of strings– Navigational ops = downward traversal– CPU-time is not the only resource, 1 I/O is surely paied
![Page 55: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/55.jpg)
0 0 0 0 1 11 1 1 1 1 10 1 1 1 0 10 1 1 1 0 1
0 1 1 1 10 0 1 1 00 1
PT‘s topology may be dropped !![Ferguson, 92]
Take the in-order visit of PT :– SP[1,k] array of pointers to S‘s strings (ie. PT leaves)– Lcp[1,k-1] array of LCPs between strings adjacent in SP
SP p1 p2 p3 p4 p5 p6
Lcp 2 4 5 0 2
S‘s strings on Disk
![Page 56: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/56.jpg)
PT‘s topology may be dropped !![Ferguson, 92]
Take the in-order visit of PT :– SP[1,k] array of pointers to S‘s strings (ie. PT leaves)– Lcp[1,k-1] array of LCPs between strings adjacent in SP
Init x = 1 i = 1
SP p1 p2 p3 p4 p5 p6
Lcp 2 4 5 0 2
011011
0 0 0 0 1 11 1 1 1 1 10 1 1 1 0 10 1 1 1 0 1
0 1 1 1 10 0 1 1 00 1
P x = 2
x = 3
If P[ Lcp[i]+1 ]=1 i++, x=i ;
else “jump”;
Check P[lcp+1]If 0 go left, else right until Lcp[i] lcp
x = 4
Correct
x = 4jump
x is the candidate position, lcp=3
![Page 57: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/57.jpg)
In summary
Node contains (let k=|S|):– A pointer array SP[1,k]
– An integer array Lcp[1,k-1], stored by Continuation Bit
Searching P’s position among S’s strings:– 1 I/O to fetch the disk page containing node – 2 array scans: O(p+k) chars and integer comparisons
– 1 string access to the candidate string, O(p/B) I/Os
Since k is about a thousands of strings:– The I/O to fetch the disk page takes 5,000 s
– The two array scans are very fast: 200 s (cache prefetching)
» The string access might deploy “incremental search”
Same I/O-bounds as before, and about 5N bytes of space in practice
![Page 58: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/58.jpg)
Research Issues
Provide a public implementation of String B-trees Refer to Berkeley-DB for the API
Multi-dimensional substring queries: multi-field record
search May we plug Geometric data structures in String B-trees
?
Xpath queries: How to index a labeled tree for path queries
? /doc/author/name/*paolo*
Stream of queries, possibly biased: String B-tree is not
optimal May we devise a self-adjusting index ? [Sleator-Tarjan,
85] Cache-oblivious tries: No explicit paramerization on B
String B-tree are balanced but B-dependant !
![Page 59: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/59.jpg)
Index Construction(Building a full-text index is a challenging task !)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 60: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/60.jpg)
Some considerations
We have already shown that the Suffix Array SA and the corresponding LCP array suffice to build the String B-tree
How do we build the arrays SA and Lcp ? In-memory algorithms are inefficient Naming + Ext_Sort efficient but space consuming [Crauser et
al., 00]
theoretically optimal algorithm, but complicated and space costly
[Ferragina et al., 98]
There exists an algorithm which is [BaezaYates et al., 92]
Theoretically unacceptable: cubic I/O complexity
Practically very appealing for performance and space occupancy Its asymptotics can be improved with some tricks [Crauser et al.,
00]
![Page 61: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/61.jpg)
Suffix Array merge (first step)
Fetch in memory the first piece T[1,L] and build SA and Lcp for the suffixes which start at positions 1..L Possibly some extra I/Os are needed, cmp 1st and 9th suffix
SA 1 9 5 2 10 4 7 8 6 3 Lcp 3 1 1 2 0 1 0 1 0To disk SAext
Lcpext
1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
DiskAATCAGCGAATGCTGCTT CTGTTGATGAT
L=10 L L
Induction: We have SAext and Lcpext for the suffixes starting inside T[1,iL], we extend this to the suffixes starting in T[iL+1, (i+1)L]
We aim at executing mainly bulk I/Os
![Page 62: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/62.jpg)
C 0 0 0 0 0 0 0 0 0 0C 1 0 0 0 0 0 0 0 0 0
Suffix Array merge (inductive step)
SAext 1 9 5 2 10 4 7 8 6 3
Induction: Fetch in memory the next piece T[iL+1, (i+1)L] and build SA and
LcpSA 20 13 16 12 15 18 11 14 17 Lcp ....
Scan T[1,iL] on disk and compute an in-memory “counting” array C
1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30DiskAATCAGCGAATGCTGCTT CTGTTGATGAT
i=1processed
» This takes O(iL/B) I/Os [actually bulk I/Os]
C 2 0 0 0 0 0 0 0 0 0C 2 0 0 0 0 0 1 0 0 0C 3 0 0 0 0 0 1 0 0 0C 7 0 0 2 0 0 1 0 0 0
merge ?
Search within SA the position of each suffix starting into T[1,iL]
Lcpext 3 1 1 2 0 1 0 1 0
![Page 63: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/63.jpg)
Suffix Array merge (inductive step)
Merge SAext and SA by using the array C, via a disk scan
SA 20 13 16 12 15 18 11 14 17
C 7 0 0 2 0 0 1 0 0 0SA 1 9 5 2 10 4 7 8 6 3ext
SA ext 1 9 5 2 10 4 7 20 13 16 8 6 12
The I/O-complexity of the i-th step is: Fetching T[iL+1, (i+1)L] takes O(L/B) I/Os (bulk I/Os)
Merging SAext[1,iL] and SA[1,L] via C[1,L+1] takes O(iL/B) I/Os (bulk I/Os)
Overall the algorithm executes O(N2/M2) I/Os in practice, mainly bulk I/Os.
In the worst-caseit is a cubic
bound !!
15 18 11 14 173
Building SA and LCP takes practically no I/Os (or few randoms)
Computing C via a scan of T[1,iL] takes O(iL/B) I/Os (bulk I/Os)
![Page 64: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/64.jpg)
String Sorting (Sorting strings is similar to sorting suffixes ?)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 65: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/65.jpg)
On the nature of string sorting
In internal memory, we know an optimal bound: Via a compacted trie we get (K log2 K + N) time
Lower bound comes from the “sorting of K elements”
In external memory, we would expect to achieve:
( (K/B) logM/B (K/B) + (N/B)) I/Os
but,
• String B-trees allow to achieve ( K logB K + (N/B)) I/Os
• Three-way quicksort gets ( K log2 K + N) I/Os [Bentley-Sedgewick, 97]
The situation is much complicated, the complexity depends on: “breaking” strings into chars is allowed the string size relative to B
![Page 66: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/66.jpg)
The scenarioLet us define (K = KS + KL ; N = NS + NL ):
KS and NS for strings smaller than B
KL and NL for strings longer than B
If strings are indivisible everywhere (it is optimal):
(NS/B) logM/B (NS/B) + KL logM/B KL + (NL/B)
short long
If strings are only indivisible in external memory:
min { KS logM KS , (NS/B) logM/B (NS/B) } + KL logM/B KL + (NL/B)
short long
If strings may be chopped into pieces: O(N/B) I/Os
It is a randomized algorithm [Ferragina-
Thorup, 97]
The average string length should be ((logM/B (N/B))2 log2 K )
![Page 67: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/67.jpg)
02010
00050
30400
02010
10000
30600
26000
1 3 6 2 4 5
Table T Forward scan
00050
1 3 6 2 4 5
Copy the lcp but leave unchanged the mismatches
11231
42564
11264
64746
42776
17621
1 3 6 2 4 5
sort
61 471
63
4
2K-2 marked names
The randomized algorithm [Ferragina-Thorup, 97]
L=2
ababbccbab
bbbccaaabb
ababbcaabb
aabbccbbaa
bbbcccccaa
abccaabcab
1 2 3 4 5 6
11231
42564
11264
64746
42776
17621
1 2 3 4 5 6
hash
aa 6 1 ab 1 2 bb 4 3 ca 5 4 cb 3 5 cc 7 6[bc 2 -]
L-str name rank
Sort 2K-2 L-str(ie. marked ones)
30400
10000
30600
26000
3 1
![Page 68: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/68.jpg)
The randomized algorithm (contd.)
ababbccbab
bbbccaaabb
ababbcaabb
aabbccbbaa
bbbcccccaa
abccaabcab
1 2 3 4 5 6
22050
22010
30400
10000
30600
26000
Backward scan
1 3 6 2 4 5
Copy the lcp but leave unchanged the mismatches
00050
30400
02010
10000
30600
26000
1 3 6 2 4 5
Input Table T after Forward Scan
30600
26000
22010
10000
30400
22050
5 3 1 6 2 4
sort
correct
See the survey
11231
42564
11264
64746
42776
17621
1 3 6 2 4 5
Hashed and sorted strings
13
![Page 69: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/69.jpg)
Research issues
Close the various gaps Long strings in the case of indivisibility on external memory Better analysis for the randomized algorithm
Implement all those algorithms
What about cache-oblivious string sorting algorithms ? Most of them are based on tries Arbitrary length creates a lot of problems Probably the randomized approach can help in this case too
![Page 70: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/70.jpg)
Compressed Indexes(Is space overhead the tax to pay for using a full-text index
?)
String algorithms and data
structures(or, tips and tricks for index design)
Paolo Ferragina
![Page 71: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/71.jpg)
Disks are cheaper and cheaper
![Page 72: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/72.jpg)
Why compressing data ? Compression has two positive effects:
Space saving Performance improvement
Better use of memory levels close to processor Increased disk and memory bandwidth Reduced (mechanical) seek time» CPU speed makes (de)compression “costless” !!
Knuth in the 3rd vol says: “Space optimization is closely related to time optimization in a disk memory system”
Well established: “It is more economical to store data in compressed form than uncompressed”
IBM released in March 2001 the Memory eXpansion Technology (MXT) plugged into eServers x330 Double memory at about same cost and performance
![Page 73: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/73.jpg)
The scenario
Classical full-text indexes use (N log2 N) bits of storage
Suffix array: O(p + log2 N + occ) time
String B-tree: O( (p/B) + logB N + (occ/B)) I/Os
Succinct suffix trees use N log2 N + (N) bits of storage
[Munro et al., 97....]
Large constantsLarge constantsfrom 5 to 25from 5 to 25
Suffix permutation cannot be any from {1, 2, ..., N} # binary texts = 2N « N! = # permutations on {1, 2, ..., N}
Compact suffix array uses (N) bits of storage [Grossi-Vitter, 00]
Query time is O( (p/ log2 N) + occ (log2 N)) time
May we achieve o(N) on compressible texts ?May we achieve o(N) on compressible texts ?As in the case of word-based indexesAs in the case of word-based indexes
Really needed ?!Really needed ?!
![Page 74: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/74.jpg)
Example: ... ... +39.050.521232, +39.050.521304,
+39.06.5421245, +39.02.342109, +39.012.256312,
+39.050.2212764, ……
The problem Input:
– A constant-sized alphabet – An arbitrarily long text T[1,N] over
Query on an arbitrary string P[1,p]:– Count the occurrences of P in T– Locate the positions of the occurrences of P in T
Aim at exploiting repetitiveness in the input to squeeze the index !!
Does it exist an “opportunistic index” ?
count the calls from Rome (+39.06.*)
locate who called from CS-dept in Pisa (+39.050.22127*)
Squeeze!!
![Page 75: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/75.jpg)
The FM-index [Ferragina-Manzini, 00]
Bridging data-structure design and compression techniques: Suffix array data structure Burrows-Wheeler Transform
The nice stuff is that this result: is independent on the input source, ie. pointwise on T implicitely shows that Suffix Arrays are “compressible”
In practice, the FM-index is much appealing: Space close to the best known compressors Query time of few millisecs on hundreds of MBs of text
The theoretical result:
Query complexity: O(p + occ log N) time
Space occupancy: O( N Hk(T)) + o(N) bitsk-th order empirical entropy, it may be o(1)
bzip2 compression algorithm (1994)
o(N) if T is compressible
![Page 76: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/76.jpg)
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
The BW-Transform
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
F L
Every column is a permutation of T, hence also F and L
T
![Page 77: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/77.jpg)
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
BWT is invertible
# mississipp ii #mississip pi ppi#missis s
F L
Take two equal L’s chars
2. How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Rotate their rows
1. L’s chars precede F’s in T
Reconstruct backward T
i
Same relative order !!
We stop at “ i# ” soon !!
3. Hence, i-th “c” in L is the i-th “c” in F
![Page 78: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/78.jpg)
T= #
Reconstruct T
i #mississip p
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
BWT is invertible (contd.)
# mississipp i
i ppi#missis s
F LTwo properties:
L’s chars precede F’s in T
i-th “c” in L = i-th “c” in F
i
i
p
p
p
p...
... in O(N) time
![Page 79: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/79.jpg)
i #mississip p
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
L is highly compressible
# mississipp i
i ppi#missis s
F LTwo observations:
Equal substr prefix adjacent rows
Close chars are “similar”
Locality !!
Algorithm Bzip :
Move-to-Front coding of L L’
Run-Length coding of L’ L’’
Statistical coder on L’’: Arithmetic
Bzip compresses much better than Gzip, but it slower in (de)compression !!
![Page 80: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/80.jpg)
Rotated text
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
mississippi
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
Suffix Array vs. BW-transform
ipssm#pissii
L
12
1185211097463
SA
Full-text searches within the string L ?
![Page 81: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/81.jpg)
Full-text search in L
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
# 0i 1m 6p 7s 9
C
Availa
ble in
foP = siFirst step
sp
ep Inductive step: Given sp,ep for P[i+1,p] Take
c=P[i]
P[ i ]
Find first c in L[sp,...]
Find last c in L[..., sp]
L-to-F mapping of these chars
sp
epocc=2[ep-sp+1]
![Page 82: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/82.jpg)
Locate the occurrences
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
P = si
sp
ep
T = mississippi#
This occurrence is listed immediately !
For this, we need to go backward:
missississippi#ep’s row
From s’s position we get 4 + 3 = 7, ok !!
In theory, we set to (log N) tobalance space and listing time
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
sampling step is 4
1 4 8 12
3
s
![Page 83: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/83.jpg)
The FM-index in practice
Collection AP-news 64Mb Space Query Time
Tiny index 22 % 2 ms(counting only)
Fat index 35 % 5 ms (locating one occurrence)
Grep on Gzipped files (Zgrep)
37 % 6,000 ms
We developed two tools: Tiny index supports just the counting of the occurrences Fat index supports both count and locate
both of them encapsulate a compressed copy of the text
Lossless fingerprint : Existential and counting queries fast
![Page 84: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/84.jpg)
Word-based compressed index
T = …bzip…bzip2unbzip2unbzip …� � � � � �
What about word-based occurrence of P ? Search for P as a substring of T, using the FM-index For every candidate occurrence, check if it a word-based one
The FM-index can be adapted to be word-based Preprocess T to form a “digested” text DT Build an FM-index over DT Transform any word-based occurrence on T, into a substring
occurrence on DT, and solve it using the FM-index built on DT
word prefix substring suffix P=bzip
...the post-processing phase can be very costly.
![Page 85: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/85.jpg)
The WFM-index
Variant of Huffman algorithm:
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged
1 0 0
Byte-aligned codeword
tagging
yes no
no
Any word7 bits
Codeword
huffman
WFM-index1. Dictionary of words
2. Huffman tree
3. FM-index built on DT
0
0 0
1 1
1
0 01
1 01
[bzip] [ ]
[bzip][ ] [not]
[or]
T = “bzip or not bzip”
1
[ ]
DT
space ~ 22 %word search ~ 4 ms
P= bzip= 10
yes
![Page 86: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/86.jpg)
Research issues
Achieve O(occ) time in occurrence retrieval O( N Hk(T) (log N) ) + o(N) bits [Ferragina-Manzini, 01]
Achieve O(occ/B) I/Os in occurrence retrieval Known compressed indexes perform random accesses
Fast constuction algorithms for Suffix Arrays Bzip compression or FM-index construction Suffix Tree construction Clustering of documents []
Implement the IR-tool: WFM-index + Glimpse This improves theoretically the Inverted Lists
![Page 87: String algorithms and data structures (or, tips and tricks for index design)](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56813bb1550346895da4e64a/html5/thumbnails/87.jpg)
The end
“By few years, we will be able to store everything” [Gray, 99]
Plato (in Phaedrus) suggested that writing would crate
“forgetfulness in the minds of those who learn to use it” and “the show of wisdom without the reality”.
I hope that this will not occur, again !!