chen li ( 李晨 )

27
Chen Li ( 李 李) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua.

Upload: niabi

Post on 24-Feb-2016

150 views

Category:

Documents


0 download

DESCRIPTION

Search As You Type. Chen Li. Chen Li ( 李晨 ). Joint work with colleagues at UCI and Tsinghua . Demos. http://www.cs.stanford.edu/ “Search” Box Try “ garcia molina ” Try “ garcia monila ” http://directory.uci.edu/ : Try “ venkatasubramanian ” http://psearch.ics.uci.edu/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chen Li ( 李晨 )

Chen Li (李晨 )Chen Li

Search As You Type

Joint work with colleagues at UCI and Tsinghua.

Page 2: Chen Li ( 李晨 )

Demos http://www.cs.stanford.edu/ “Search” Box

Try “garcia molina” Try “garcia monila”

http://directory.uci.edu/: Try “venkatasubramanian”

http://psearch.ics.uci.edu/ http://fr.ics.uci.edu/haiti/ http://www.miamiherald.com/news/americ

as/haiti/connect/ http://ipubmed.ics.uci.edu/

Page 3: Chen Li ( 李晨 )

Too many

results!

Traditional Keyword Search

No result!

Complicated and stillno result!

Page 4: Chen Li ( 李晨 )

Interactive Fuzzy Keyword Search

Page 5: Chen Li ( 李晨 )

What’s new?

Search on apple.comQuery: “itune”

Missing result!

Query: “itunes music”

Page 6: Chen Li ( 李晨 )

Challenge: performance! < 100 ms: server processing, network,

javascript, etc

Requirement for high query throughput 20 queries per second (QPS) 50ms/query (at

most) 100 QPS 10ms/query

Other challenges: ranking, space requirements, …

Page 7: Chen Li ( 李晨 )

Two Features (Focus of this talk) Fuzzy Search: finding results with

approximate keywords

Full-text: find results with query keywords (not necessarily adjacently)

Page 8: Chen Li ( 李晨 )

8 8

Ed(s1, s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2

s1: v e n k a t s u b r a m a n i a n

s2: w e n k a t s u b r a m a n i a n

ed(s1, s2) = 1

Edit Distance

Page 9: Chen Li ( 李晨 )

Problem Setting Data

R: a set of records W: a set of distinct words

Query Q = {p1, p2, …, pl}: a set of prefixes δ: Edit-distance threshold

Query result RQ: a set of records such that each record has

all query prefixes or their similar forms

Page 10: Chen Li ( 李晨 )

Feature 1: Fuzzy Search

Page 11: Chen Li ( 李晨 )

Formulation

Record Strings

wenkatsubra

Find strings with a prefix similar to a query keyword Do it incrementally!

venkatasubramanian

careyjainnicolausmith

Query:

Page 12: Chen Li ( 李晨 )

Observation Strings = {exam, example, exemplar, exempt,

sample} Edit-distance threshold δ = 2

Prefix Distance

exam 2examp 1exampl 0example 1exemp 2exempt 2exempl 1exempla 2sampl 2

Prefix Distance

examp 2exampl 1example 0exempl 2exempla 2sample 2

delete e

delete ematch e

delete e

replace e with a

match e

Q’ = exampl Q = example

Page 13: Chen Li ( 李晨 )

Trie Indexing

Computing set of active nodes ΦQ

Initialization Incremental step

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

examp 2exampl 1example 0exempl 2exempla 2sample 2

Active nodes for Q = example

e

2

1

0

2

2

2

Page 14: Chen Li ( 李晨 )

Initialization Q = ε

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

0

1 1

2 2

Prefix Distance0

e 1ex 2s 1sa 2

Prefix Distance

ε 0

Initializing Φε with all nodes within a depth of δ

e

Page 15: Chen Li ( 李晨 )

Incremental Algorithm: Overview

Access their leaf nodes as answers.

Page 16: Chen Li ( 李晨 )

e

Incremental Computation: Example Q = e

e

x

a

m

p

l

$

$

e

m

p

l

ar$

t

$

s

a

m

p

l

e

$

Prefix Distance

ε 0e 1ex 2s 1sa 2Prefix

# Op

Base

Op

ε 1 ε del es 1 ε sub

e/se 0 ε mat eex 1 ε ins xexa 2 ε Ins xaexe 2 ε Ins xe

Prefix

# Op

Base

OpPrefix

# Op

Base

Op

ε 1 ε del e

Prefix

# Op

Base

Op

ε 1 ε del es 1 ε sub

e/s

Prefix

# Op

Base

Op

ε 1 ε del es 1 ε sub

e/se 0 ε mat e

1

10

1

2 2

e 2 e del eex 2 e sub

e/xex 3 ex del eexa 3 ex sub

e/aexe 2 ex mat e

s 2 s del e

sa 2 s sub e/a

sa 3 sa del e

Active nodes for Q = ε Active nodes for Q = e

2

Page 17: Chen Li ( 李晨 )

Incremental Computation: Algorithm Incremental computation from ΦQ’ to ΦQ

add(ΦQ , <n, d>) has effect only if there exists no active node in ΦQ with the same n and smaller d

FOR EACH <n, d> FROM ΦQ’

Deletion add(ΦQ , <n, d+1>)Substitution

FOR EACH n’ FROM non-matching children of n

add(ΦQ , <n’, d+1>)Match add(ΦQ , <m, d>)

(m is the matching child of n)Insertion FOR EACH m’ FROM descendents of m

add(ΦQ , <m’, d+x>)(x is the distance from m’ to m)

Algorithm Details

Page 18: Chen Li ( 李晨 )

Feature 2: Full-text search Find answers with query keywords Not necessarily adjacently

Page 19: Chen Li ( 李晨 )

Multi-Prefix Intersection Q = vldb liID Record

1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…

6 VLDB Lin data…

8 Li VLDB…

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

Page 20: Chen Li ( 李晨 )

Multi-Prefix Intersection: Method 1ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

1 3 4 5 6 86 7 8

livldb

6 8

Q = vldb li

Space cost Inverted indexTime cost Union +

intersection

More efficient intersection approaches…

Page 21: Chen Li ( 李晨 )

Multi-Prefix Intersection: Method 2Forward List 1 2 1 1 3 3 5 6 4 1 3 7 7 2 7

d

a

t

a

$

l

i

n u

$

u

$

v

l

d

b

$

1236

5

4 678

$

346

i

s

$

18

$

4

ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin

data…7 VLDB…8 Li VLDB…

[1, 7]

[1, 1]

[1, 1]

[1, 1]

[1, 1]

[2, 6]

[2, 4]

1

2

3 4

5

6 7

[3, 3] [4, 4]

[5, 6]

[6, 6]

[6, 6]

[7, 7]

[7, 7]

[7, 7]

[7, 7]

Q = vldb li

678 [2, 4]Read each Verify/Probe

6 VLDB Lin data…

1 3 7

8 Li VLDB… 2 7

Space cost Inverted + forward index

Time cost Probing forward lists

Page 22: Chen Li ( 李晨 )

Traversing inverted lists incrementally

Compute and cache only needed answers For subsequent queries, compute the

answers: from the cached answers from resuming previously terminated

computation

Q = cs co

cached answers of cs co

traversal list: inverted list of cs

compute

Q = cs conf

Verify

cached answers of cs conf

Compute

Page 23: Chen Li ( 李晨 )

Experimental Results Computing similar prefixes

Page 24: Chen Li ( 李晨 )

Multi-prefix intersection

Page 25: Chen Li ( 李晨 )

Time Scalability

Page 26: Chen Li ( 李晨 )

Index scalability

Page 27: Chen Li ( 李晨 )

Conclusions New data-access paradigm: Search as you

type Many interesting and challenging

problems.

http://tastier.ics.uci.edu/