search algorithms winter semester 2004/2005 15 nov 2004 5th lecture

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

15 Nov 20045th Lecture

Christian Schindelhauerschindel@upb.de

Search Algorithms, WS 2004/05 2

Chapter II

Chapter IISearching in

Compressed Text15 Nov 2004

Searching in Compressed Text (Overview)

What is Text Compression– Definition– The Shannon Bound– Huffman Codes– The Kolmogorov Measure

Searching in Non-adaptive Codes– KMP in Huffman Codes

Searching in Adaptive Codes– The Lempel-Ziv Codes– Pattern Matching in Z-Compressed Files– Adapting Compression for Searching

Ziv-Lempel-Welch (LZW)-Codes

From the Ziv-Lempel-Family– LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP

Literature– LZW: Terry A. Welch: "A Technique for High Performance Data

Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19– LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data

Compression", IEEE Transactions, p. 337-343– LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via

Variable-Rate Coding", IEEE Transactions on Information, p. 530-536known as Unix-command: “compress”Uses:

– TRIES

Trie = “reTRIEval TREE”

Name taken out of “ReTRIEval” Tree

–for storing/encoding text–efficient search for equal prefices

Structure–Edges labelled with letters–Nods are numbered

Mapping–Every node encodes a word of the text–The text of a node can be read on the path from the root to the node

• Node 1 = “m”• Node 6 = “at”

–Inverse direction: Every word uniquely points at a node

–(or at least some prefix points to a leaf) • “it” = node 11• “manaman” points with “m” to node 1

Encoding of –“manamanatapitipitipi”–1,2,3,4,5,6,7,8,9,10,11,12 or–1,5,4,5,6,7,11,10,11,10,8

Decoding of–5,11,2–“an”, “it”, “a” = anita

How LZW builds a TRIE

LZW – works bytewise– starts with the 256-leaf trie with leafs “a”, “b”, ...

numbered with “a”, “b”, ...

LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od

... zz

Example: nanananananana-

n......

naScanned:

How LZW builds a TRIE

LZW – works bytewise– starts with the 256-leaf trie with leafs “a”,

“b”, ... numbered with “a”, “b”, ...

LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od

Example: nanananananana-

n......

naScanned:

nanananananana

Continue with:

nanananananana nan

Residual part:

nanananananana

How LZW produces the encoding

LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (m,u,T[i])9. m m+110. append leaf m to u with edge label T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi

The output m is predictable:256,257,258,...

Therefore use onlyoutput(u,T[i])

start-Trie = 256-leaf trie with bytes encoded as

0,1,2,..,255

An Example Encoding

LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (u,T[i])9. m m+110. append leaf m to u with edge label

T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi

256 257

Encoding of m a n a m a n a t a p i t i p i t i p i

(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi

259 260

The Decoder

LZW-Decoder(Code)– i 1– TRIE start-TRIE– m 255– for i 0 to 255 do C(i)=“i” od– while not end of file do– (u,c) read-next-two-symbols(Code);– if c exists then– output (C(u), c)– m m+1– append leaf m to u with edge label

c– C(m) (C(u),c)– else– output (C(u))– odIf the last string of the code did not produce a new node

in the trie then output thecorresponding string

256 257

Encoding of m a n a m a n a t a p i t i p i t i p i

(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi

259 260

Performance of LZW

Encoding can be performed in time O(n)– where n is the length of the given text

Decoding can be performed in time O(n)– where n is the length of the uncompressed output

The memory consumption is linear in the size of the compressed code

LZW can be nicely implemented in hardwareThere is no software patent

– so it is very populary, see “compress” for UNIXLZW can be further compressed using Huffman-Codes

– Every second character is a plain copy from the text!

Search in LZW is difficult– The encoding is embedded in the text (adaptive encoding)– For one search in a text there is a linear number of possibilities of encodings of the

search pattern (!)

The Algorithm of Amir, Benson & Farach“Let Sleeping Files Lie”

Ideas– Build the Trie, but do not decode– Use KMP-Matcher with the nodes of the LZW-Trie– Prepare a data structure based on the pattern m– Then, scan the text and update this data structure

Goal: Running time of O(n + f(m))– where n is the code length– f(m) is some small polynomial depending on the pattern length m– for well compressed codes and f(m)<n it should be faster than decoding

and then running text search

Searching in LZW-CodesInside a node

Example: Search for tapioca

abtapiocaab blahblahabarb

tapiocais “inside” a node

Then we have found tapiocaFor all nodes u of a trie:

Set: Is_inside[u]=1 ifthe text of u contains the pattern

Searching in LZW-CodesTorn apart

Example: Search for tapioca

carasiabrastap io

Startingsomewhere in

a node

Parts are hiddenin some other

The end is thestart of another

All parts arenodes of the

LZW-Trie

Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns

Is the suffix of the node a prefix of the pattern

– And if yes, how long is it?– Classify all nodes of the trie

For very long text encoded by a node only the last m letters matter

Can be computed using the KMP-Matcher-algorithm while building the Trie

Example: –Pattern: “manamana”

pamanaThe last fourletter are thefirst four ofthe pattern

length of suffix of node which is prefix of patter is 2

papa result: 0

mana result: 4

amanaplanacanalpamana

result: 4

amanaplanacanalpamana m

amanaplanacanalpamanam

Is the node inside of a Pattern?

Find positions where the text of the node is inside the pattern

Several occurrences are possible–e.g. one letter–There are at most m(m-1)/2 encodings of such sub-strings

–For every sub-string there is exactly one node that fits

Define table Inside-Node of size O(m2)–Inside-Node[start,end] := Node that encodes pattern P[start]..P[end]

From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created

To quickly find all occurrences use pointer–Next-inside-occurrence(start,end) indicates the next position where the substrings lies

–It is initialized for start=end with the next occurrence of the letter

This text could be in positions 2-4 or positions 6-8 of the pattern

anamresult: (2,5)

rorororororo result: (0,0)is not in the pattern

Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern

Is the prefix of the node a suffix of the pattern

– And if yes, does it complete the pattern, if already i letters were found?

– Classify all nodes of the trieFor very long text encoded by a node

only the first m letters matter

Since the text is added at the right side this property can be derived from the ancestor

ananimal

Here 3 and 1 could be the solutionWe take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using on the reverse string)

manamanamanaresult: 8

panamacanal result: 0

manammanaaaaaaaaaaaa

manammanaaaaaaaaaaaa m

manammanaaaaaaaaaaaam

How does it fit?

On the left side we have the maximum prefix of the pattern

On the right side we have the maximum suffix of the pattern

panapamapama

10 letter pattern:pamapamapa

pamapana14 letters?

Yet the pattern is inside, though,since the last 6 letters +

the first 8 letters of the patterngive the pattern

8 letter prefix found 6 letter suffix found

Solution: Define prefix-suffix-table PS-T(p,s) = 1 if

p-letter prefix of P and s-letter suffix of P contain the pattern

Computing the PS-Table in time O(m3)

For all p and s such that p+sm compute PS-T[p,s]

Run the KMP-Matcher for pattern P in P[m-p+1..m]P[1..s]

– needs time O(m) for each combination of p and s

Leads to run time of O(n3)

xyzpamapama pamapaxyz

10 letter pattern:pamapamapa

PS-T[8,6] = 1

If pattern pamapamapa found in text

pamapamapamapathen

Computing the Prefix-Suffix-Table in Time O(m2) - Preparation

ptr[i,j] = next left position from i where the suffix of P of length j occurs = max{k < i | P[m-j+1..m] = P[k..k+j-1] or k = 0}

p a m a p a m a p a

Computing the Prefix-Suffix-Table in time O(m2)Initialization

Init-ptr (P)1. m length(P)2. for i 1 to m do ptr[i,0] i-1 od3. for j 1 to m-1 do4. last m-j+15. i ptr[last+1,j-1]-1 6. while i 0 do 7. if P[i]=P[last] then 8. ptr[last,j] i9. last i10. fi11. i ptr[i+1,j-1]-1 12. od13. od

p a m a p a m a p a

Run time: O(m2)

Computing the Prefix-Suffix-Table in time O(m2)

Init-PS-T(P)1. m length(P)2. ptr Init-ptr(P)3. for i 1 to m-1 do4. j i+15. while j 0 do 6. PS-T[i,m-j+1] = 17. j ptr[j,m-i]8. od9. od

p a m a p a m a p a

ptr[9,2]ptr[5,2]

PS-T[8,2]=1

PS-T[8,6]=1PS-T[8,8]=1

ABF-LZW-Matcher(LZW-Code C, uncompressed pattern P)

1. n length( C), m length( M) 2. Init-PS-T(P)3. longest_prefix[P[1]] 14. longest_suffix[P[m]] 15. for i 1 to m do 6. inside_node[i,i] P[i]7. od8. Compute-Prefix(P)9. TRIE start-TRIE10. v 25511. prefix 012. for i 0 to 255 do 13. C(i)=“i” 14. od 15. for l 1 to n do16. (u,c) read-next-two-symbols(Code)17. v v+118. Update_DS()19. Check_for_Occurrence()20. od

longest prefix of P can be found in node P[1]

longest suffix of P can be found in node P[m]

Only single node characters can be inside of P

Standard LZW-Trie Initialization

Insert new node v into data structure

Check for occurences of P

Update Data Structure

Update_DS()1. length[v] length[u]+12. /* omitted C[v] C[u]c */3. is_inside[v] is_inside[u] 4. if longest_prefix[u]< m and P[longest_prefix[u]+1]= c

then5. longest_prefix[v] longest_prefix[u] +16. fi7. if length[u]<m then8. for all entries (start,end) of u in inside_node9. do if P[end+1]=c and end<m then10. inside_node[start,end+1] v11. Link new entry of v12. fi13. do14. if longest_suffix[u] < length[u] or P[length[v]] c then15. longest_suffix[v] longest_suffix[u] 16. else 17. longest_suffix[v] 1+longest_suffix[u]18. if longest_suffix[v] = m then is_inside[v] 1 fi19. fi

Standard LZW code

if u contains the pattern, so does v

There is a linked list of u for all positions of inside_node pointing to u

manamm x

manammx

manama n

manaman

xyzmana m

xyzmanam

This occurs at most m2

times over all rounds

Check for Occurrences

Check_for_Occurrences()1. if is_inside[v] = m then2. return “pattern found at l”3. prefix longest_prefix[v]4. else if prefix = 0 then 5. prefix longest_prefix[v]6. else if prefix + length[v] < m then7. while prefix 0 and inside-node[prefix+1,prefix+length[v]] v do8. prefix (prefix)9. od10. if prefix = 0 then prefix longest_prefix[v]11. else prefix prefix+length[v] 12. fi13. else14. suffix longest_suffix[v]15. if PS-T[prefix,suffix]=1 then16. return “pattern found at l”17. prefix longest_prefix[v]18. else19. prefix longest_prefix[v]20. fi21. fi

xyzmanamanaxyz

xyzmana man

xyzmana namanaxy

Like in KMP-matcher

This occurs at most || m2 times

Running Time of the Matcher

Initialization needs time O(m2)

Amortized analysis leads to (additional) time for checking of inner words of O(min{N,|| m2})

– Every inner word occurs at most || times– Where N is the length of the uncompressed text– and n is the length of the compressed text

Run time: O(n + m2 + min{N,|| m2})

For small search pattern faster than the alternative– which is Decompress and apply Boyer-Moore-Matcher

Text Compression Allowing Fast Searching Directly

“A Text Compression Scheme that Allows Fast Searching Directly in the Compressed File”, Udi Manber, ACM Trans. Inf. Systems, Vol15, No. 2 , 1997,124-136

Idea:– Do not use LZ-compression or Huffman Codes– Combine some letter pairs (a,b) and encode them into the “free” ASCII space

(128-255)– Let f(a,b) denote the weight of such a pair– Encode the 128 most frequent pairs into a letter of {128,..,255} each– Use only non-overlapping pairs V1 times V2 that are disjoint, i.e.– Sum of weights of f(a,b) gives the compression ratio– Then one can apply Boyer-Moore-Algorithm directly on the code– Since pattern and text will be encoded with the same byte string

Problem: Choosing these sets optimally is NP-complete!Solution:

– Greedy heuristic (of unclear performance) gives compression rate of 28-33%

Example

The most common “digraphs”:– th er on an re he in ed nd ha ...

Encoding: f(th)=128, f(er)= 129, f(on)=130, f(an)=131, f(ed)=132No compression: re, he, in, nd, ha

Chapter III

Chapter IIISearching the Web

15 Nov 2005

Problems of Searching the Web

Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages– 10.000 words cover more than 95% of each text– much more web-pages than words– Users hardly ever look through more than 40 results

The problem is not to find a pattern, but to find the most important pages

Problems:– Important pages do not contain the search pattern

• www.porsche.com does not contain sports car or even car• www.google.com does not contain web search engine• www.airbus.com does not contain airplane

– Certain pages have nearly every word (dictionary)– Names are misleading

• http://www.whitehouse.org/ is not the web-site of the white house• www.theonion.com is not about vegetables

– Certain pattern can be found everywhere, e.g. page, web, windows, ...

How to rank Web-pages

The main problem about searching the web is to rank the importance

Links are very helpful:– Humans are usually introduced on purpose– The context of the links gives some clues about the meaning of the web-page– Pages where many people point to are of probably very important– Most search rely on links

Other approach: Ontology of words– Compare the combination of words with the search word– Good for comparing text– Difficult if single word patterns are given

Thanks for your attentionEnd of 5th lectureNext lecture: Mo 22 Nov 2004, 11.15 am, FU 116

Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316

search algorithms winter semester 2004/2005 15 nov 2004 5th lecture

Documents

soils in construction 5th ed. - w. schroeder, et. al.,...

greedy algorithms 15-211 fundamental data structures and...

error-detection codes: algorithms and fast...

parallel and distributed algorithms eric vidal reference: r....

search algorithms winter semester 2004/2005 10 jan 2005 11th...

towards run-time assurance of advanced propulsion...

tax final (taxation 1) as at 5th october 2004

advanced tools and algorithms in bioinformatics chittibabu...

cs222 algorithms first semester 2003/2004

analysis of algorithms, 91 - umass...

liver eqa meeting october 5th 2004 circulation p

algorithms for ad hoc and sensor networks roger wattenhofer...

m1180153 daemon hunters faq 2004-08-5th edition

alex kesselman, mpi internet algorithms: design and analysis...

search algorithms winter semester 2004/2005 22 nov 2004 6th...

kimbell v. us, 93 aftr 2d 2004-2400 (5th cir. 2004) ·...

superpave workshop 2004 -5th edition.pdf

elementary graph algorithms comp 122, fall 2004

5th untele conference university of compiègne march 2004

5th annual report 2004 - trent university