chen li ( 李晨 )

Chen Li (李晨 )Chen Li

Search As You Type

Joint work with colleagues at UCI and Tsinghua.

http://www.uci.edu/

Demos http://www.cs.stanford.edu/ “Search” Box

Try “garcia molina” Try “garcia monila”

http://directory.uci.edu/: Try “venkatasubramanian”

http://psearch.ics.uci.edu/ http://fr.ics.uci.edu/haiti/ http://www.miamiherald.com/news/americ

as/haiti/connect/ http://ipubmed.ics.uci.edu/

http://www.cs.stanford.edu/

http://directory.uci.edu/

http://psearch.ics.uci.edu/

http://fr.ics.uci.edu/haiti/

http://www.miamiherald.com/news/americas/haiti/connect/

http://www.miamiherald.com/news/americas/haiti/connect/

http://ipubmed.ics.uci.edu/

Too many

results!

Traditional Keyword Search

No result!

Complicated and stillno result!

Interactive Fuzzy Keyword Search

What’s new?

Search on apple.comQuery: “itune”

Missing result!

Query: “itunes music”

Challenge: performance! < 100 ms: server processing, network,

javascript, etc

Requirement for high query throughput 20 queries per second (QPS) 50ms/query (at

most) 100 QPS 10ms/query

Other challenges: ranking, space requirements, …

Two Features (Focus of this talk) Fuzzy Search: finding results with

approximate keywords

Full-text: find results with query keywords (not necessarily adjacently)

8 8

Ed(s1, s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2

s1: v e n k a t s u b r a m a n i a n

s2: w e n k a t s u b r a m a n i a n

ed(s1, s2) = 1

Edit Distance

Problem Setting Data

R: a set of records W: a set of distinct words

Query Q = {p1, p2, …, pl}: a set of prefixes δ: Edit-distance threshold

Query result RQ: a set of records such that each record has

all query prefixes or their similar forms

Feature 1: Fuzzy Search

Formulation

Record Strings

wenkatsubra

Find strings with a prefix similar to a query keyword Do it incrementally!

venkatasubramanian

careyjainnicolausmith

Query:

Observation Strings = {exam, example, exemplar, exempt,

sample} Edit-distance threshold δ = 2

Prefix Distance

exam 2examp 1exampl 0example 1exemp 2exempt 2exempl 1exempla 2sampl 2

Prefix Distance

examp 2exampl 1example 0exempl 2exempla 2sample 2

delete e

delete ematch e

delete e

replace e with a

match e

Q’ = exampl Q = example

Trie Indexing

Computing set of active nodes ΦQ

Initialization Incremental step

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

examp 2exampl 1example 0exempl 2exempla 2sample 2

Active nodes for Q = example

e

2

1

0

2

2

2

Initialization Q = ε

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

0

1 1

2 2

Prefix Distance0

e 1ex 2s 1sa 2

Prefix Distance

ε 0

Initializing Φε with all nodes within a depth of δ

e

Incremental Algorithm: Overview

Access their leaf nodes as answers.

e

Incremental Computation: Example Q = e

e

x

a

m

p

l

$

$

e

m

p

l

ar$

t

$

s

a

m

p

l

e

$

Prefix Distance

ε 0e 1ex 2s 1sa 2Prefix

# Op

Base

Op

ε 1 ε del es 1 ε sub

e/se 0 ε mat eex 1 ε ins xexa 2 ε Ins xaexe 2 ε Ins xe

Prefix

# Op

Base

OpPrefix

# Op

Base

Op

ε 1 ε del e

Prefix

# Op

Base

Op


e/s

Prefix

# Op

Base

Op


e/se 0 ε mat e

1

10

1

2 2

e 2 e del eex 2 e sub

e/xex 3 ex del eexa 3 ex sub

e/aexe 2 ex mat e

s 2 s del e

sa 2 s sub e/a

sa 3 sa del e

Active nodes for Q = ε Active nodes for Q = e

2

Incremental Computation: Algorithm Incremental computation from ΦQ’ to ΦQ

add(ΦQ , <n, d>) has effect only if there exists no active node in ΦQ with the same n and smaller d

FOR EACH <n, d> FROM ΦQ’

Deletion add(ΦQ , <n, d+1>)Substitution

FOR EACH n’ FROM non-matching children of n

add(ΦQ , <n’, d+1>)Match add(ΦQ , <m, d>)

(m is the matching child of n)Insertion FOR EACH m’ FROM descendents of m

add(ΦQ , <m’, d+x>)(x is the distance from m’ to m)

Algorithm Details

Feature 2: Full-text search Find answers with query keywords Not necessarily adjacently

Multi-Prefix Intersection Q = vldb liID Record

1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…

6 VLDB Lin data…

8 Li VLDB…

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

Multi-Prefix Intersection: Method 1ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

1 3 4 5 6 86 7 8

livldb

6 8

Q = vldb li

Space cost Inverted indexTime cost Union +

intersection

More efficient intersection approaches…

Multi-Prefix Intersection: Method 2Forward List 1 2 1 1 3 3 5 6 4 1 3 7 7 2 7

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin

data…7 VLDB…8 Li VLDB…

[1, 7]

[1, 1]

[1, 1]

[1, 1]

[1, 1]

[2, 6]

[2, 4]

1

2

3 4

5

6 7

[3, 3] [4, 4]

[5, 6]

[6, 6]

[6, 6]

[7, 7]

[7, 7]

[7, 7]

[7, 7]

Q = vldb li

678 [2, 4]Read each Verify/Probe

6 VLDB Lin data…

1 3 7

8 Li VLDB… 2 7

Space cost Inverted + forward index

Time cost Probing forward lists

Traversing inverted lists incrementally

Compute and cache only needed answers For subsequent queries, compute the

answers: from the cached answers from resuming previously terminated

computation

Q = cs co

cached answers of cs co

traversal list: inverted list of cs

compute

Q = cs conf

Verify

cached answers of cs conf

Compute

Experimental Results Computing similar prefixes

Multi-prefix intersection

Time Scalability

Index scalability

Conclusions New data-access paradigm: Search as you

type Many interesting and challenging

problems.

http://tastier.ics.uci.edu/

http://tastier.ics.uci.edu/

chen li ( 李晨 )

Documents