efficient top-k algorithms for fuzzy search in string collections · department of computer science...
TRANSCRIPT
![Page 1: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/1.jpg)
Efficient Top-k Algorithms for Fuzzy Search in StringCollections
Rares Vernica Chen Li
Department of Computer ScienceUniversity of California, Irvine
First International Workshop on Keyword Search on StructuredData, 2009
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
![Page 2: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/2.jpg)
Outline
1 Motivation
2 Efficient Top-k AlgorithmsProblem FormulationAlgorithms OverviewTop-k Single-Pass Search Algorithm
3 Experimental Evaluation
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
![Page 3: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/3.jpg)
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 Results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
![Page 4: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/4.jpg)
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 Results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
![Page 5: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/5.jpg)
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 ResultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
![Page 6: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/6.jpg)
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
![Page 7: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/7.jpg)
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
![Page 8: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/8.jpg)
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
![Page 9: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/9.jpg)
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?
Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
![Page 10: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/10.jpg)
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
![Page 11: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/11.jpg)
Top-k Similar Strings
Given:Weighted string collection
e.g., actors’ LastName and # Movies
Query stringe.g., “Shwartzenetrugger”
Similarity functione.g, Edit Distance
Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity
Integer k
Return: k best strings in terms of overall score to the query string.
Advantages over Range Search:specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
![Page 12: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/12.jpg)
Top-k Similar Strings
Given:Weighted string collection
e.g., actors’ LastName and # Movies
Query stringe.g., “Shwartzenetrugger”
Similarity functione.g, Edit Distance
Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity
Integer k
Return: k best strings in terms of overall score to the query string.Advantages over Range Search:
specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
![Page 13: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/13.jpg)
Algorithms Overview
Iterative RangeSearch
Single-PassSearch Two-Phase Search
RangeSearch
RangeSearch
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
![Page 14: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/14.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica
→ {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 15: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/15.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica
→ {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 16: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/16.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve
,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 17: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/17.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er
,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 18: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/18.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn
,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 19: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/19.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni
,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 20: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/20.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic
,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 21: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/21.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 22: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/22.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 23: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/23.jpg)
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
![Page 24: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/24.jpg)
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
![Page 25: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/25.jpg)
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
⇒
![Page 26: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/26.jpg)
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”
Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
![Page 27: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/27.jpg)
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”
Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
![Page 28: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/28.jpg)
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
⇐
![Page 29: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/29.jpg)
Top-k Single-pass Search Algorithm
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
![Page 30: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/30.jpg)
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weights
Sort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
![Page 31: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/31.jpg)
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorder
Scan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
![Page 32: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/32.jpg)
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
![Page 33: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/33.jpg)
Top-k Single-pass Search Algorithm
Naïve approach: Round-RobinScan all the lists in the same timeMaintain a list of “open” IDs (mightstill appear on some of the lists)Store the best k “closed” IDs in atop-k bufferStop when the top-k buffer cannotimprove
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
![Page 34: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/34.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
99
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 35: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/35.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1
→
2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
1
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 36: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/36.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
12
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 37: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/37.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
122
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 38: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/38.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
122
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 39: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/39.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain thelist of “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
22
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 40: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/40.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
22
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 41: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/41.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
22
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 42: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/42.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 43: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/43.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
34
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 44: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/44.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
34
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 45: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/45.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
420
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 46: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/46.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
420
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 47: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/47.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 48: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/48.jpg)
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20
→
3
→
421 4 5
......
...19 19→20 →20
......
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
![Page 49: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/49.jpg)
Algorithms Overview
Iterative RangeSearch
Single-PassSearch Two-Phase Search
RangeSearch
RangeSearch
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
![Page 50: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/50.jpg)
Experimental Setting
Datasets:IMDB Actor Names1
actor names and the number of movies they played in1.2 million actors, average name length 15weight is the number of movies (log normalized)
WEB Corpus Word Grams2
sequences of English words and their frequency on the Web2.4 million sequences, average sequence length 20weight is the frequency (log normalized)
Jaccard similarity and normalized edit similarity, q = 3Index and data are stored in main memory at all timesImplemented in C++ (GNU compiler) on Ubuntu Linux OSIntel 2.40GHz PC, 2GB RAM
1http://www.imdb.com/interfaces2http://www.ldc.upenn.edu/Catalog LDC2006T13Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
![Page 51: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/51.jpg)
Benefits of Skipping Elements
0.2 0.4 0.6 0.8 1.0 1.2
Dataset Size (millions)
0
10
20
30
40
50
Tim
e (m
s)
SPSSPS*
Average running time fortop-10 queries. IMDBdataset with Jaccardsimilarity. Single-PassSearch (SPS) algorithmand SPS withoutskipping (SPS*).
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
![Page 52: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/52.jpg)
Potential of the Two-Phase Algorithm
Q1 Q2 Q3
Queries
0
10
20
30
40
Tim
e (m
s)
Running time for 3top-10 queries. WEBCorpus dataset withnormalized editsimilarity. Two-Phasealgorithm with differentinitial thresholds.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
![Page 53: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/53.jpg)
Optimum Initial Threshold for the Two-Phase Algorithm
0.4 0.8 1.2 1.6 2.0 2.4
Dataset Size (millions)
0
20
40
60
80
100
Tim
e (m
s)
SPS2PH2PH Opt
Average running time fortop-10 queries. WebCorpus dataset withnormalized editsimilarity. Single-PassSearch (SPS) algorithm,Two-Phase (2PH)algorithm, 2PH with theoptimum initial threshold(2PH Opt).The Iterative Range Searchalgorithm was to expensive to beplotted. The average running timewas around 5s.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
![Page 54: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/54.jpg)
Summary
Approximate ranking queries in string collectionsUseful when mismatch between query and dataPropose three approaches to solve the problem:
1 Use existing approximate range search algorithms as a “black box”Proves to be the most expensive
2 Use particularities of the top-k problemProves to be very efficient
3 Combine (1) and (2) sequentiallyProves to be slightly more efficient
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
![Page 55: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/55.jpg)
The Flamingo Project
This work is part ofThe Flamingo Project at UC Irvinehttp://flamingo.ics.uci.edu
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
![Page 56: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search](https://reader033.vdocuments.mx/reader033/viewer/2022060515/5f88ac667033cd7a8577b9c2/html5/thumbnails/56.jpg)
A Quick Note on Related Work
Fagin et. al [1]similarity on multiplenumerical attributestraverse list of IDsone list per attributelists sorted on similarity tothat attributelists have different orders ofIDsall IDs appear on all the lists
Our Settingsimilarity on one stringattributetraverse list of IDsone list per q-gramlists sorted on global weight
lists have the same order ofIDsa subset of IDs appear oneach list
[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17