search algorithms winter semester 2004/2005 15 nov 2004 5th lecture
Post on 12-Feb-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Search AlgorithmsWinter Semester 2004/2005
15 Nov 20045th Lecture
Christian Schindelhauerschindel@upb.de
Search Algorithms, WS 2004/05 2
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Chapter II
Chapter IISearching in
Compressed Text15 Nov 2004
Search Algorithms, WS 2004/05 3
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in Compressed Text (Overview)
What is Text Compression– Definition– The Shannon Bound– Huffman Codes– The Kolmogorov Measure
Searching in Non-adaptive Codes– KMP in Huffman Codes
Searching in Adaptive Codes– The Lempel-Ziv Codes– Pattern Matching in Z-Compressed Files– Adapting Compression for Searching
Search Algorithms, WS 2004/05 4
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Ziv-Lempel-Welch (LZW)-Codes
From the Ziv-Lempel-Family– LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP
Literature– LZW: Terry A. Welch: "A Technique for High Performance Data
Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19– LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data
Compression", IEEE Transactions, p. 337-343– LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via
Variable-Rate Coding", IEEE Transactions on Information, p. 530-536known as Unix-command: “compress”Uses:
– TRIES
Search Algorithms, WS 2004/05 5
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Trie = “reTRIEval TREE”
Name taken out of “ReTRIEval” Tree
–for storing/encoding text–efficient search for equal prefices
Structure–Edges labelled with letters–Nods are numbered
Mapping–Every node encodes a word of the text–The text of a node can be read on the path from the root to the node
• Node 1 = “m”• Node 6 = “at”
–Inverse direction: Every word uniquely points at a node
–(or at least some prefix points to a leaf) • “it” = node 11• “manaman” points with “m” to node 1
Encoding of –“manamanatapitipitipi”–1,2,3,4,5,6,7,8,9,10,11,12 or–1,5,4,5,6,7,11,10,11,10,8
Decoding of–5,11,2–“an”, “it”, “a” = anita
0
1
m
2
a
3
n
4
m
5
n
6
t
7
p
8
i
10
p
9
t
11
t
12
i
Search Algorithms, WS 2004/05 6
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
How LZW builds a TRIE
LZW – works bytewise– starts with the 256-leaf trie with leafs “a”, “b”, ...
numbered with “a”, “b”, ...
LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od
-
a
a
b
b
c
c
d
d
... zz
Example: nanananananana-
a
a
n
n......
naScanned:
na
a
Search Algorithms, WS 2004/05 7
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
How LZW builds a TRIE
LZW – works bytewise– starts with the 256-leaf trie with leafs “a”,
“b”, ... numbered with “a”, “b”, ...
LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od
Example: nanananananana-
a
a
n
n......
naScanned:
nanananananana
a
Continue with:
nanananananana nan
Residual part:
nanananananana
Search Algorithms, WS 2004/05 8
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
How LZW produces the encoding
LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (m,u,T[i])9. m m+110. append leaf m to u with edge label T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi
The output m is predictable:256,257,258,...
Therefore use onlyoutput(u,T[i])
start-Trie = 256-leaf trie with bytes encoded as
0,1,2,..,255
Search Algorithms, WS 2004/05 9
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
An Example Encoding
LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (u,T[i])9. m m+110. append leaf m to u with edge label
T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi
0
m
m
n
a
a
a
256 257
n
t p
i
i
262
p
t
t
261
t
Encoding of m a n a m a n a t a p i t i p i t i p i
(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi
258
n
a
259 260
263
i
p
p
264
i
Search Algorithms, WS 2004/05 10
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Decoder
LZW-Decoder(Code)– i 1– TRIE start-TRIE– m 255– for i 0 to 255 do C(i)=“i” od– while not end of file do– (u,c) read-next-two-symbols(Code);– if c exists then– output (C(u), c)– m m+1– append leaf m to u with edge label
c– C(m) (C(u),c)– else– output (C(u))– odIf the last string of the code did not produce a new node
in the trie then output thecorresponding string
0
m
m
n
a
a
a
256 257
n
t p
i
i
262
p
t
t
261
t
Encoding of m a n a m a n a t a p i t i p i t i p i
(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi
258
n
a
259 260
263
i
p
p
264
i
Search Algorithms, WS 2004/05 11
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Performance of LZW
Encoding can be performed in time O(n)– where n is the length of the given text
Decoding can be performed in time O(n)– where n is the length of the uncompressed output
The memory consumption is linear in the size of the compressed code
LZW can be nicely implemented in hardwareThere is no software patent
– so it is very populary, see “compress” for UNIXLZW can be further compressed using Huffman-Codes
– Every second character is a plain copy from the text!
Search in LZW is difficult– The encoding is embedded in the text (adaptive encoding)– For one search in a text there is a linear number of possibilities of encodings of the
search pattern (!)
Search Algorithms, WS 2004/05 12
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Algorithm of Amir, Benson & Farach“Let Sleeping Files Lie”
Ideas– Build the Trie, but do not decode– Use KMP-Matcher with the nodes of the LZW-Trie– Prepare a data structure based on the pattern m– Then, scan the text and update this data structure
Goal: Running time of O(n + f(m))– where n is the code length– f(m) is some small polynomial depending on the pattern length m– for well compressed codes and f(m)<n it should be faster than decoding
and then running text search
Search Algorithms, WS 2004/05 13
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in LZW-CodesInside a node
Example: Search for tapioca
abtapiocaab blahblahabarb
tapiocais “inside” a node
Then we have found tapiocaFor all nodes u of a trie:
Set: Is_inside[u]=1 ifthe text of u contains the pattern
Search Algorithms, WS 2004/05 14
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in LZW-CodesTorn apart
Example: Search for tapioca
carasiabrastap io
Startingsomewhere in
a node
Parts are hiddenin some other
nodes
The end is thestart of another
node
All parts arenodes of the
LZW-Trie
Search Algorithms, WS 2004/05 15
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns
Is the suffix of the node a prefix of the pattern
– And if yes, how long is it?– Classify all nodes of the trie
For very long text encoded by a node only the last m letters matter
Can be computed using the KMP-Matcher-algorithm while building the Trie
Example: –Pattern: “manamana”
pamanaThe last fourletter are thefirst four ofthe pattern
mama
length of suffix of node which is prefix of patter is 2
papa result: 0
mana result: 4
amanaplanacanalpamana
result: 4
amanaplanacanalpamana m
amanaplanacanalpamanam
Search Algorithms, WS 2004/05 16
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Is the node inside of a Pattern?
Find positions where the text of the node is inside the pattern
Several occurrences are possible–e.g. one letter–There are at most m(m-1)/2 encodings of such sub-strings
–For every sub-string there is exactly one node that fits
Define table Inside-Node of size O(m2)–Inside-Node[start,end] := Node that encodes pattern P[start]..P[end]
From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created
To quickly find all occurrences use pointer–Next-inside-occurrence(start,end) indicates the next position where the substrings lies
–It is initialized for start=end with the next occurrence of the letter
Example: –Pattern: “manamana”
ana
This text could be in positions 2-4 or positions 6-8 of the pattern
anamresult: (2,5)
rorororororo result: (0,0)is not in the pattern
ana m
Search Algorithms, WS 2004/05 17
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern
Is the prefix of the node a suffix of the pattern
– And if yes, does it complete the pattern, if already i letters were found?
– Classify all nodes of the trieFor very long text encoded by a node
only the first m letters matter
Since the text is added at the right side this property can be derived from the ancestor
Example: –Pattern: “manamana”
ananimal
Here 3 and 1 could be the solutionWe take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using on the reverse string)
manamanamanaresult: 8
panamacanal result: 0
manammanaaaaaaaaaaaa
manammanaaaaaaaaaaaa m
manammanaaaaaaaaaaaam
Search Algorithms, WS 2004/05 18
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
How does it fit?
On the left side we have the maximum prefix of the pattern
On the right side we have the maximum suffix of the pattern
panapamapama
10 letter pattern:pamapamapa
pamapana14 letters?
Yet the pattern is inside, though,since the last 6 letters +
the first 8 letters of the patterngive the pattern
8 letter prefix found 6 letter suffix found
Solution: Define prefix-suffix-table PS-T(p,s) = 1 if
p-letter prefix of P and s-letter suffix of P contain the pattern
Search Algorithms, WS 2004/05 19
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing the PS-Table in time O(m3)
For all p and s such that p+sm compute PS-T[p,s]
Run the KMP-Matcher for pattern P in P[m-p+1..m]P[1..s]
– needs time O(m) for each combination of p and s
Leads to run time of O(n3)
xyzpamapama pamapaxyz
10 letter pattern:pamapamapa
PS-T[8,6] = 1
If pattern pamapamapa found in text
pamapamapamapathen
Search Algorithms, WS 2004/05 20
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing the Prefix-Suffix-Table in Time O(m2) - Preparation
ptr[i,j] = next left position from i where the suffix of P of length j occurs = max{k < i | P[m-j+1..m] = P[k..k+j-1] or k = 0}
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
Search Algorithms, WS 2004/05 21
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing the Prefix-Suffix-Table in time O(m2)Initialization
Init-ptr (P)1. m length(P)2. for i 1 to m do ptr[i,0] i-1 od3. for j 1 to m-1 do4. last m-j+15. i ptr[last+1,j-1]-1 6. while i 0 do 7. if P[i]=P[last] then 8. ptr[last,j] i9. last i10. fi11. i ptr[i+1,j-1]-1 12. od13. od
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
p a m a p a m a p a
Run time: O(m2)
Search Algorithms, WS 2004/05 22
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing the Prefix-Suffix-Table in time O(m2)
Init-PS-T(P)1. m length(P)2. ptr Init-ptr(P)3. for i 1 to m-1 do4. j i+15. while j 0 do 6. PS-T[i,m-j+1] = 17. j ptr[j,m-i]8. od9. od
p a m a p a m a p a
p a m a p a m a p a
ptr[9,2]ptr[5,2]
PS-T[8,2]=1
PS-T[8,6]=1PS-T[8,8]=1
Search Algorithms, WS 2004/05 23
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
ABF-LZW-Matcher(LZW-Code C, uncompressed pattern P)
1. n length( C), m length( M) 2. Init-PS-T(P)3. longest_prefix[P[1]] 14. longest_suffix[P[m]] 15. for i 1 to m do 6. inside_node[i,i] P[i]7. od8. Compute-Prefix(P)9. TRIE start-TRIE10. v 25511. prefix 012. for i 0 to 255 do 13. C(i)=“i” 14. od 15. for l 1 to n do16. (u,c) read-next-two-symbols(Code)17. v v+118. Update_DS()19. Check_for_Occurrence()20. od
longest prefix of P can be found in node P[1]
longest suffix of P can be found in node P[m]
Only single node characters can be inside of P
Standard LZW-Trie Initialization
Insert new node v into data structure
Check for occurences of P
Search Algorithms, WS 2004/05 24
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Update Data Structure
Update_DS()1. length[v] length[u]+12. /* omitted C[v] C[u]c */3. is_inside[v] is_inside[u] 4. if longest_prefix[u]< m and P[longest_prefix[u]+1]= c
then5. longest_prefix[v] longest_prefix[u] +16. fi7. if length[u]<m then8. for all entries (start,end) of u in inside_node9. do if P[end+1]=c and end<m then10. inside_node[start,end+1] v11. Link new entry of v12. fi13. do14. if longest_suffix[u] < length[u] or P[length[v]] c then15. longest_suffix[v] longest_suffix[u] 16. else 17. longest_suffix[v] 1+longest_suffix[u]18. if longest_suffix[v] = m then is_inside[v] 1 fi19. fi
Standard LZW code
if u contains the pattern, so does v
There is a linked list of u for all positions of inside_node pointing to u
manamm x
manammx
manama n
manaman
xyzmana m
xyzmanam
This occurs at most m2
times over all rounds
Search Algorithms, WS 2004/05 25
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Check for Occurrences
Check_for_Occurrences()1. if is_inside[v] = m then2. return “pattern found at l”3. prefix longest_prefix[v]4. else if prefix = 0 then 5. prefix longest_prefix[v]6. else if prefix + length[v] < m then7. while prefix 0 and inside-node[prefix+1,prefix+length[v]] v do8. prefix (prefix)9. od10. if prefix = 0 then prefix longest_prefix[v]11. else prefix prefix+length[v] 12. fi13. else14. suffix longest_suffix[v]15. if PS-T[prefix,suffix]=1 then16. return “pattern found at l”17. prefix longest_prefix[v]18. else19. prefix longest_prefix[v]20. fi21. fi
xyzmanamanaxyz
xyzmana man
xyzmana namanaxy
Like in KMP-matcher
This occurs at most || m2 times
Search Algorithms, WS 2004/05 26
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Running Time of the Matcher
Initialization needs time O(m2)
Amortized analysis leads to (additional) time for checking of inner words of O(min{N,|| m2})
– Every inner word occurs at most || times– Where N is the length of the uncompressed text– and n is the length of the compressed text
Run time: O(n + m2 + min{N,|| m2})
For small search pattern faster than the alternative– which is Decompress and apply Boyer-Moore-Matcher
Search Algorithms, WS 2004/05 27
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Text Compression Allowing Fast Searching Directly
“A Text Compression Scheme that Allows Fast Searching Directly in the Compressed File”, Udi Manber, ACM Trans. Inf. Systems, Vol15, No. 2 , 1997,124-136
Idea:– Do not use LZ-compression or Huffman Codes– Combine some letter pairs (a,b) and encode them into the “free” ASCII space
(128-255)– Let f(a,b) denote the weight of such a pair– Encode the 128 most frequent pairs into a letter of {128,..,255} each– Use only non-overlapping pairs V1 times V2 that are disjoint, i.e.– Sum of weights of f(a,b) gives the compression ratio– Then one can apply Boyer-Moore-Algorithm directly on the code– Since pattern and text will be encoded with the same byte string
Problem: Choosing these sets optimally is NP-complete!Solution:
– Greedy heuristic (of unclear performance) gives compression rate of 28-33%
Search Algorithms, WS 2004/05 28
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Example
The most common “digraphs”:– th er on an re he in ed nd ha ...
Encoding: f(th)=128, f(er)= 129, f(on)=130, f(an)=131, f(ed)=132No compression: re, he, in, nd, ha
t he
r
o
ni
ad
V1V2
Search Algorithms, WS 2004/05 29
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Chapter III
Chapter IIISearching the Web
15 Nov 2005
Search Algorithms, WS 2004/05 30
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Problems of Searching the Web
Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages– 10.000 words cover more than 95% of each text– much more web-pages than words– Users hardly ever look through more than 40 results
The problem is not to find a pattern, but to find the most important pages
Problems:– Important pages do not contain the search pattern
• www.porsche.com does not contain sports car or even car• www.google.com does not contain web search engine• www.airbus.com does not contain airplane
– Certain pages have nearly every word (dictionary)– Names are misleading
• http://www.whitehouse.org/ is not the web-site of the white house• www.theonion.com is not about vegetables
– Certain pattern can be found everywhere, e.g. page, web, windows, ...
Search Algorithms, WS 2004/05 31
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
How to rank Web-pages
The main problem about searching the web is to rank the importance
Links are very helpful:– Humans are usually introduced on purpose– The context of the links gives some clues about the meaning of the web-page– Pages where many people point to are of probably very important– Most search rely on links
Other approach: Ontology of words– Compare the combination of words with the search word– Good for comparing text– Difficult if single word patterns are given
32
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Thanks for your attentionEnd of 5th lectureNext lecture: Mo 22 Nov 2004, 11.15 am, FU 116
Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316
top related