1 notes 06: efficient fuzzy search professor chen li department of computer science uc irvine...

35
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring 2015

Upload: grace-stephens

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1

Notes 06: Efficient Fuzzy Search

Professor Chen LiDepartment of Computer Science

UC Irvine

CS122B: Projects in Databases and Web Applications Spring 2015

Page 2: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

22

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred Schwarrzenger.

Page 3: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

33

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu ReevesStar

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity

SchwarrzengerSchwarrzenger

Page 4: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE

4

Page 5: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

5

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

5

Page 6: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

6

Page 7: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

7

Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores

7

Page 8: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan

(Efficiency?)

8

Page 9: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

99

Outline Gram-based approaches Trie-based approaches

Page 10: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1010

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

Page 11: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1111

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

Page 12: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1212

Main ExampleQuery

Merge

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings

0 rich

1 stick

2 stich

3 stuck

4 static

ck

ic

st

ta

ti…

1,3

1,2,3,4

4

1,2,4

ed(s,q)≤1

0,0,1,2,41,2,4

Candidates

Page 13: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1313

Problem definition:

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

Page 14: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1414

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

Page 15: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1515

Five Merge Algorithms

HeapMerger[Sarawagi,SIGMOD

2004]

MergeOpt[Sarawagi,SIGMOD

2004]

ScanCount MergeSkip DivideSkip

Page 16: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1616

Heap-based Algorithm

Min-heap

Count # of the occurrences of each element by a heap

Push to heap ……

Page 17: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1717

MergeOpt Algorithm

Long Lists: T-1 Short Lists

Binary

search

Page 18: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1818

Example of MergeOpt [Sarawagi et al 2004]

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

Page 19: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

1919

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

Page 20: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2020

ScanCount Example

1 2 3

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

# of occurrences# of occurrences

00

00

00

44

11

Increment by 1

Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

Page 21: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2121

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

Page 22: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2222

MergeSkip algorithm

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

Page 23: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2323

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15

Count threshold T≥ 4

minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

Page 24: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2424

Skip is safe

Min-heap ……

# of occurrences of skipped elements ≤T-1

Skip

Page 25: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2525

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

Page 26: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

26

DivideSkip Algorithm

Long Lists Short Lists

Binary

searchMergeSkip

Page 27: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2727

How many lists are treated as long lists?

??

Short ListsMerge

Long ListsLookup

Page 28: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2828

Performance (DBLP)

DivideSkip is the best one

Page 29: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

2929

Trie-Based Approach

Page 30: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Trie Indexing

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Strings

exam

example

exemplar

exempt

sample

30

Page 31: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Active nodes on Trie

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Prefix Distance

examp 2

exampl 1

example 0

exempl 2

exempla 2

sample 2

Query: “example”

Edit-distance threshold = 2

2

1

0

2

2

2

31

Page 32: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Initialization

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Q = ε 0

1 1

2 2

Prefix DistancePrefix Distance

0

e 1

ex 2

s 1

sa 2

Prefix Distance

ε 0

Initial active nodes: all nodes within depth δ

32

Page 33: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

Incremental Algorithm

Return leaf nodes as answers.33

Page 34: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

34

Advantages: Trie size is small Can do search as the user types

DisadvantagesWorks for edit distance only

Good and bad

34

Page 35: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring

3535

References1. Efficient Merging and Filtering Algorithms for

Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008

2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009