david novak and michal batko }w !#$%&'()+,-./012345
TRANSCRIPT
Metric Index: An Efficient and Scalable Solution forSimilarity Search
David Novak and Michal Batko
Masaryk UniversityBrno, Czech Republic
}w���������� ������������� !"#$%&'()+,-./012345<yA|August 29, 2009
SISAP 2009, Prague
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 1 / 18
Outline of the Talk
1 Motivation and ObjectivesSimilarity Indexing for Metric Spaces
2 Metric IndexOne-level M-IndexMulti-level M-IndexDynamic M-IndexM-Index Search PrinciplesApproximate Strategy for M-Index
3 Experimental Evaluation
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 2 / 18
Similarity Indexing in Metric Spaces
Efficient and scalable metric-based index is still an issue
15 years of research:
principles of space partitioning, search space pruning, filtering
fundamental memory-based structures
advanced index and search solutions
approximate similarity search
distributed index structures
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 3 / 18
Metric Index
Intentions of M-Index:
synergically employ practically all known metric principlesof space pruning and filtering
fixed building costs (static set of reference points)
use well-established efficient structures to factually store data
indexing based on mapping to real domain (use B+-tree)
efficient precise and approximation similarity search
straightforward way to distribute the structure
Notation and assumptions:
Metric space M is a pair M = (D, d), where D is a set and d is atotal function d : D ×D −→ [0, 1) satisfying standard metric spaceconditions.
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 4 / 18
Metric Index
Intentions of M-Index:
synergically employ practically all known metric principlesof space pruning and filtering
fixed building costs (static set of reference points)
use well-established efficient structures to factually store data
indexing based on mapping to real domain (use B+-tree)
efficient precise and approximation similarity search
straightforward way to distribute the structure
Notation and assumptions:
Metric space M is a pair M = (D, d), where D is a set and d is atotal function d : D ×D −→ [0, 1) satisfying standard metric spaceconditions.
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 4 / 18
Preliminaries: iDistance
qr
*c *c0 c
0
p2
p1
2 3
C2
C1
p
0C
2*c 3*c0 c
C
C2
C1
p
p
p
1
0 2
0
(iDistance)
(b)(a)
iDist(o) = d(pi , o) + i · c
indexing technique for vector spaces
application of object-pivot distance constraint
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 5 / 18
M-Index Level One
p3
p0
p1
p2
p3C3
C2
C1
C0
C3
C2
C1
C0
p0
p1
p2
0 431 2
rq
index based purely on metric principles
Voronoi partitioning
double-pivot distance constraint for search-space pruning
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 6 / 18
Multi-level M-Index (1)
Having a fixed set of n pivots {p0, p1, . . . , pn−1} and an object o ∈ D, let
(·)o : {0, 1, . . . , n − 1} −→ {0, 1, . . . , n − 1}
be a permutation of indexes such that
d(p(0)o , o) ≤ d(p(1)o , o) ≤ · · · ≤ d(p(n−1)o , o).
l-level M-Index uses l-prefix of the pivot permutation, 1 ≤ l ≤ n.
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 7 / 18
Multi-level M-Index (2)
p3
p0
p1
p2
p3
p2
p0
p1
C3,2
C2,1
C1,2
C0,3
0,1C
C3,0
1,3C
C1,0
C2,3C3,1
C3
C2
C1
C0
in l-level M-Index, ∀o ∈ D : o ∈ C(0)o ,...,(l−1)o
repetitive application of double-pivot distance constraint
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 8 / 18
Multi-level M-Index Mapping
p1
C1,2C1,0
1,3C
......0 87654 16
key l(o) =
= d(p(0)o , o) +l−1∑i=0
(i)o · n(l−1−i)
integral part of the key
identification of the cluster
fractional part of the key
position within the cluster
size of the key l domain: nl
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 9 / 18
M-Index with Dynamic Level
0C 1C0,2C
l=2
l=3
l=1
0,1C
lkey
0,2,3C0,2,1C 0,2,C n nC ,0,1 nC ,0,2 nC ,0,n−1
...... ...
......
0 n3
C0,n
CnC
n,0
nC nC,1 ,n−1
establish a maximum level 1 ≤ lmax ≤ n
slightly modify the key l formula
analogous to extensible hashing + object-pivot distance
physically store the data according to the key
in a B+-tree or similar structure
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 10 / 18
M-Index with Dynamic Level
0C 1C0,2C
l=2
l=3
l=1
0,1C
lkey
0,2,3C0,2,1C 0,2,C n nC ,0,1 nC ,0,2 nC ,0,n−1
...... ...
......
0 n3
C0,n
CnC
n,0
nC nC,1 ,n−1
establish a maximum level 1 ≤ lmax ≤ n
slightly modify the key l formula
analogous to extensible hashing + object-pivot distance
physically store the data according to the key
in a B+-tree or similar structure
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 10 / 18
M-Index Search Principles
Search space pruning principles for R(q, r) query
double-pivot distance constraint
skip accessing of cluster Ci if
d(pi , q)− d(p(0)q, q) > 2 · r
due to recursive Voronoi partitioningapply l-times for Ci0,...,il−1
range-pivot distance constraint
for leaf-level clusters Cp,∗ store min. and max. distance
rmax = max{d(p, o)|o ∈ Cp,∗}
skip accessing of cluster Cp,∗ if
d(p, q) + r < rmin or d(p, q)− r > rmax
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 11 / 18
M-Index Search Principles
Search space pruning principles for R(q, r) query
double-pivot distance constraint
skip accessing of cluster Ci if
d(pi , q)− d(p(0)q, q) > 2 · r
due to recursive Voronoi partitioningapply l-times for Ci0,...,il−1
range-pivot distance constraint
for leaf-level clusters Cp,∗ store min. and max. distance
rmax = max{d(p, o)|o ∈ Cp,∗}
skip accessing of cluster Cp,∗ if
d(p, q) + r < rmin or d(p, q)− r > rmax
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 11 / 18
M-Index Search Principles (cont.)
object-pivot distance constraint
the M-Index key contains the object-pivot distanceidentify interval of keys in cluster Ci0,...,il−1
[d(pi0 , q)− r , d(pi0 , q) + r ]
pivot filtering
store distances d(p0, o), . . . , d(pn−1, o) together with object oskip computation of d(q, o) at query time if
maxi∈{0,...,n−1}
|d(pi , q)− d(pi , o)| > r .
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 12 / 18
M-Index Search Principles (cont.)
object-pivot distance constraint
the M-Index key contains the object-pivot distanceidentify interval of keys in cluster Ci0,...,il−1
[d(pi0 , q)− r , d(pi0 , q) + r ]
pivot filtering
store distances d(p0, o), . . . , d(pn−1, o) together with object oskip computation of d(q, o) at query time if
maxi∈{0,...,n−1}
|d(pi , q)− d(pi , o)| > r .
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 12 / 18
Approximate Strategy for M-Index
p3
p2
p0
p1
C3,2
C2,1
C1,2
C0,3
0,1C
q
C3,0
1,3C
C1,0
C2,3C3,1
0.2
0.3
0.25
0.5
Determine the order in which to visitindividual clusters
priority queue of clusters
heuristic which analyzes distancesd(p0, q), d(p1, q), . . . , d(pn−1, q)
each cluster is assigned a penalty
distance of the cluster from q
penalty(Ci0,...,il−1) =
l−1∑j=0
max{
d(pij , q)− d(p(j)q , q), 0}
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 13 / 18
Experimental Evaluation
200,000 objects from CoPhIR dataset
combination of five MPEG-7 descriptorsaltogether 280 dimensions, intrinsic dimensionality: 13
measure I/O costs (page reads and objects accessed), computationalcosts, response times
compare with iDistance, PM-tree (same implementation platform)
Distance computations performed during index construction (n = 20)
dataset size 20,000 80,000 140,000 200,000
M-Index 400,000 1,600,000 2,800,000 4,000,000
PM-Tree 1,205,538 6,299,207 11,627,729 16,897,996
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 14 / 18
Precise Strategy: Number of objects accessed
Data volume accessed for kNN(q, 50)
20000
40000
60000
80000
100000
0 10 20 30 40 50# of pivots
M−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
# of
acc
esse
d ob
ject
s
0
dataset size: 100,000
dynamic M-Index: lmax = 5
0 20000 40000 60000 80000
100000 120000 140000 160000
0
# ac
cess
ed o
bjec
ts
dataset size
Accessed objects k=30
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
50000 100000 150000 200000
20 pivots
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 15 / 18
Precise Strategy: Number of objects accessed
Data volume accessed for kNN(q, 50)
20000
40000
60000
80000
100000
0 10 20 30 40 50# of pivots
M−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
# of
acc
esse
d ob
ject
s
0
dataset size: 100,000
dynamic M-Index: lmax = 5
0 20000 40000 60000 80000
100000 120000 140000 160000
0
# ac
cess
ed o
bjec
ts
dataset size
Accessed objects k=30
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
50000 100000 150000 200000
20 pivots
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 15 / 18
Precise Strategy: I/O costs, response times
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0 5000
10000 15000 20000 25000 30000 35000
0
# of
blo
ck r
eads
50000 100000 150000dataset size
I/O costs
200000
higher fragmentation with moreM-Index levels
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0 200 400 600 800
1000 1200 1400 1600
0
Response times
50000 100000 150000 200000dataset size
avg
resp
onse
tim
e [m
s]
lower computational costs formore M-Index levels
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 16 / 18
Precise Strategy: I/O costs, response times
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0 5000
10000 15000 20000 25000 30000 35000
0
# of
blo
ck r
eads
50000 100000 150000dataset size
I/O costs
200000
higher fragmentation with moreM-Index levels
PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0 200 400 600 800
1000 1200 1400 1600
0
Response times
50000 100000 150000 200000dataset size
avg
resp
onse
tim
e [m
s]lower computational costs formore M-Index levels
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 16 / 18
Approximate Strategy
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40maximum percentage of accessed dat [%]
Recall for approximate kNN(q,30)
M−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
reca
ll [%
]
PM−Tree
dataset of 100,000 objects
PM−TreeM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0
20
40
60
80
100
0 50000 100000dataset size
Recall for approximate kNN(q,30)
reca
ll [%
]
150000 200000
algorithm accesses 10,000 objects
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 17 / 18
Approximate Strategy
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40maximum percentage of accessed dat [%]
Recall for approximate kNN(q,30)
M−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
reca
ll [%
]
PM−Tree
dataset of 100,000 objects
PM−TreeM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index
0
20
40
60
80
100
0 50000 100000dataset size
Recall for approximate kNN(q,30)
reca
ll [%
] 150000 200000
algorithm accesses 10,000 objects
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 17 / 18
Comparison with Purely Approximate Approaches
0
20
40
60
80
100
0 1% 3% 5%% database accessed
ideal cluster orderM−IndexSpearman footruleSpearman RhoPP−Index, z=1,000
Recall for kNN(q,30)
aver
age
reca
ll [%
]
2% 4%
dataset size: 1,000,000
32 pivots (n = 32)
dynamic M-Index withlmax = 6
Spearman Footrule modified to use permutation prefixes
Induced Footrule Distance [Amato, Savino: Approximate SimilaritySearch in Metric Spaces using Inverted Files]
Spearman Rho used in [Chavez, Figueroa, Navarro: EffectiveProximity Retrieval by Ordering Permutations]
PP-Index has a similar structure as M-Index
access subtrees with at least z = 1,000 objects
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 18 / 18
Comparison with Purely Approximate Approaches
0
20
40
60
80
100
0 1% 3% 5%% database accessed
ideal cluster orderM−IndexSpearman footruleSpearman RhoPP−Index, z=1,000
Recall for kNN(q,30)
aver
age
reca
ll [%
]
2% 4%
dataset size: 1,000,000
32 pivots (n = 32)
dynamic M-Index withlmax = 6
Spearman Footrule modified to use permutation prefixes
Induced Footrule Distance [Amato, Savino: Approximate SimilaritySearch in Metric Spaces using Inverted Files]
Spearman Rho used in [Chavez, Figueroa, Navarro: EffectiveProximity Retrieval by Ordering Permutations]
PP-Index has a similar structure as M-Index
access subtrees with at least z = 1,000 objects
Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 18 / 18