chen li ( 李晨 )
DESCRIPTION
Search As You Type. Chen Li. Chen Li ( 李晨 ). Joint work with colleagues at UCI and Tsinghua . Demos. http://www.cs.stanford.edu/ “Search” Box Try “ garcia molina ” Try “ garcia monila ” http://directory.uci.edu/ : Try “ venkatasubramanian ” http://psearch.ics.uci.edu/ - PowerPoint PPT PresentationTRANSCRIPT
Chen Li (李晨 )Chen Li
Search As You Type
Joint work with colleagues at UCI and Tsinghua.
Demos http://www.cs.stanford.edu/ “Search” Box
Try “garcia molina” Try “garcia monila”
http://directory.uci.edu/: Try “venkatasubramanian”
http://psearch.ics.uci.edu/ http://fr.ics.uci.edu/haiti/ http://www.miamiherald.com/news/americ
as/haiti/connect/ http://ipubmed.ics.uci.edu/
Too many
results!
Traditional Keyword Search
No result!
Complicated and stillno result!
Interactive Fuzzy Keyword Search
What’s new?
Search on apple.comQuery: “itune”
Missing result!
Query: “itunes music”
Challenge: performance! < 100 ms: server processing, network,
javascript, etc
Requirement for high query throughput 20 queries per second (QPS) 50ms/query (at
most) 100 QPS 10ms/query
Other challenges: ranking, space requirements, …
Two Features (Focus of this talk) Fuzzy Search: finding results with
approximate keywords
Full-text: find results with query keywords (not necessarily adjacently)
8 8
Ed(s1, s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2
s1: v e n k a t s u b r a m a n i a n
s2: w e n k a t s u b r a m a n i a n
ed(s1, s2) = 1
Edit Distance
Problem Setting Data
R: a set of records W: a set of distinct words
Query Q = {p1, p2, …, pl}: a set of prefixes δ: Edit-distance threshold
Query result RQ: a set of records such that each record has
all query prefixes or their similar forms
Feature 1: Fuzzy Search
Formulation
Record Strings
wenkatsubra
Find strings with a prefix similar to a query keyword Do it incrementally!
venkatasubramanian
careyjainnicolausmith
Query:
Observation Strings = {exam, example, exemplar, exempt,
sample} Edit-distance threshold δ = 2
Prefix Distance
exam 2examp 1exampl 0example 1exemp 2exempt 2exempl 1exempla 2sampl 2
Prefix Distance
examp 2exampl 1example 0exempl 2exempla 2sample 2
delete e
delete ematch e
delete e
replace e with a
match e
Q’ = exampl Q = example
Trie Indexing
Computing set of active nodes ΦQ
Initialization Incremental step
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix Distance
examp 2exampl 1example 0exempl 2exempla 2sample 2
Active nodes for Q = example
e
2
1
0
2
2
2
Initialization Q = ε
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix Distance
0
1 1
2 2
Prefix Distance0
e 1ex 2s 1sa 2
Prefix Distance
ε 0
Initializing Φε with all nodes within a depth of δ
e
Incremental Algorithm: Overview
Access their leaf nodes as answers.
e
Incremental Computation: Example Q = e
e
x
a
m
p
l
$
$
e
m
p
l
ar$
t
$
s
a
m
p
l
e
$
Prefix Distance
ε 0e 1ex 2s 1sa 2Prefix
# Op
Base
Op
ε 1 ε del es 1 ε sub
e/se 0 ε mat eex 1 ε ins xexa 2 ε Ins xaexe 2 ε Ins xe
Prefix
# Op
Base
OpPrefix
# Op
Base
Op
ε 1 ε del e
Prefix
# Op
Base
Op
ε 1 ε del es 1 ε sub
e/s
Prefix
# Op
Base
Op
ε 1 ε del es 1 ε sub
e/se 0 ε mat e
1
10
1
2 2
e 2 e del eex 2 e sub
e/xex 3 ex del eexa 3 ex sub
e/aexe 2 ex mat e
s 2 s del e
sa 2 s sub e/a
sa 3 sa del e
Active nodes for Q = ε Active nodes for Q = e
2
Incremental Computation: Algorithm Incremental computation from ΦQ’ to ΦQ
add(ΦQ , <n, d>) has effect only if there exists no active node in ΦQ with the same n and smaller d
FOR EACH <n, d> FROM ΦQ’
Deletion add(ΦQ , <n, d+1>)Substitution
FOR EACH n’ FROM non-matching children of n
add(ΦQ , <n’, d+1>)Match add(ΦQ , <m, d>)
(m is the matching child of n)Insertion FOR EACH m’ FROM descendents of m
add(ΦQ , <m’, d+x>)(x is the distance from m’ to m)
Algorithm Details
Feature 2: Full-text search Find answers with query keywords Not necessarily adjacently
Multi-Prefix Intersection Q = vldb liID Record
1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…
6 VLDB Lin data…
8 Li VLDB…
d
a
t
a
$
l
i
n u
$
u
$
v
l
d
b
$
1236
5
4 678
$
346
i
s
$
18
$
4
Multi-Prefix Intersection: Method 1ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin data…7 VLDB…8 Li VLDB…
d
a
t
a
$
l
i
n u
$
u
$
v
l
d
b
$
1236
5
4 678
$
346
i
s
$
18
$
4
1 3 4 5 6 86 7 8
livldb
6 8
Q = vldb li
Space cost Inverted indexTime cost Union +
intersection
More efficient intersection approaches…
Multi-Prefix Intersection: Method 2Forward List 1 2 1 1 3 3 5 6 4 1 3 7 7 2 7
d
a
t
a
$
l
i
n u
$
u
$
v
l
d
b
$
1236
5
4 678
$
346
i
s
$
18
$
4
ID Record1 Li data…2 data…3 data Lin…4 Lu Lin Luis…5 Liu…6 VLDB Lin
data…7 VLDB…8 Li VLDB…
[1, 7]
[1, 1]
[1, 1]
[1, 1]
[1, 1]
[2, 6]
[2, 4]
1
2
3 4
5
6 7
[3, 3] [4, 4]
[5, 6]
[6, 6]
[6, 6]
[7, 7]
[7, 7]
[7, 7]
[7, 7]
Q = vldb li
678 [2, 4]Read each Verify/Probe
6 VLDB Lin data…
1 3 7
8 Li VLDB… 2 7
Space cost Inverted + forward index
Time cost Probing forward lists
Traversing inverted lists incrementally
Compute and cache only needed answers For subsequent queries, compute the
answers: from the cached answers from resuming previously terminated
computation
Q = cs co
cached answers of cs co
traversal list: inverted list of cs
compute
Q = cs conf
Verify
cached answers of cs conf
Compute
Experimental Results Computing similar prefixes
Multi-prefix intersection
Time Scalability
Index scalability
Conclusions New data-access paradigm: Search as you
type Many interesting and challenging
problems.
http://tastier.ics.uci.edu/