david novak and michal batko }w !#$%&'()+,-./012345

Metric Index: An Efficient and Scalable Solution forSimilarity Search

David Novak and Michal Batko

Masaryk UniversityBrno, Czech Republic

}w�� !"#$%&'()+,-./012345<yA|August 29, 2009

SISAP 2009, Prague

Novak, Batko (Masaryk University) Metric Index (M-Index) August 29, 2009 1 / 18

Outline of the Talk

1 Motivation and ObjectivesSimilarity Indexing for Metric Spaces

2 Metric IndexOne-level M-IndexMulti-level M-IndexDynamic M-IndexM-Index Search PrinciplesApproximate Strategy for M-Index

3 Experimental Evaluation


Similarity Indexing in Metric Spaces

Efficient and scalable metric-based index is still an issue

15 years of research:

principles of space partitioning, search space pruning, filtering

fundamental memory-based structures

advanced index and search solutions

approximate similarity search

distributed index structures


Metric Index

Intentions of M-Index:

synergically employ practically all known metric principlesof space pruning and filtering

fixed building costs (static set of reference points)

use well-established efficient structures to factually store data

indexing based on mapping to real domain (use B+-tree)

efficient precise and approximation similarity search

straightforward way to distribute the structure

Notation and assumptions:

Metric space M is a pair M = (D, d), where D is a set and d is atotal function d : D ×D −→ [0, 1) satisfying standard metric spaceconditions.


Preliminaries: iDistance

qr

*c *c0 c

0

p2

p1

2 3

C2

C1

p

0C

2*c 3*c0 c

C

C2

C1

p

p

p

1

0 2

0

(iDistance)

(b)(a)

iDist(o) = d(pi , o) + i · c

indexing technique for vector spaces

application of object-pivot distance constraint


M-Index Level One

p3

p0

p1

p2

p3C3

C2

C1

C0

C3

C2

C1

C0

p0

p1

p2

0 431 2

rq

index based purely on metric principles

Voronoi partitioning

double-pivot distance constraint for search-space pruning


Multi-level M-Index (1)

Having a fixed set of n pivots {p0, p1, . . . , pn−1} and an object o ∈ D, let

(·)o : {0, 1, . . . , n − 1} −→ {0, 1, . . . , n − 1}

be a permutation of indexes such that

d(p(0)o , o) ≤ d(p(1)o , o) ≤ · · · ≤ d(p(n−1)o , o).

l-level M-Index uses l-prefix of the pivot permutation, 1 ≤ l ≤ n.


Multi-level M-Index (2)

p3

p0

p1

p2

p3

p2

p0

p1

C3,2

C2,1

C1,2

C0,3

0,1C

C3,0

1,3C

C1,0

C2,3C3,1

C3

C2

C1

C0

in l-level M-Index, ∀o ∈ D : o ∈ C(0)o ,...,(l−1)o

repetitive application of double-pivot distance constraint


Multi-level M-Index Mapping

p1

C1,2C1,0

1,3C

......0 87654 16

key l(o) =

= d(p(0)o , o) +l−1∑i=0

(i)o · n(l−1−i)

integral part of the key

identification of the cluster

fractional part of the key

position within the cluster

size of the key l domain: nl


M-Index with Dynamic Level

0C 1C0,2C

l=2

l=3

l=1

0,1C

lkey

0,2,3C0,2,1C 0,2,C n nC ,0,1 nC ,0,2 nC ,0,n−1

...... ...

......

0 n3

C0,n

CnC

n,0

nC nC,1 ,n−1

establish a maximum level 1 ≤ lmax ≤ n

slightly modify the key l formula

analogous to extensible hashing + object-pivot distance

physically store the data according to the key

in a B+-tree or similar structure


M-Index Search Principles

Search space pruning principles for R(q, r) query

double-pivot distance constraint

skip accessing of cluster Ci if

d(pi , q)− d(p(0)q, q) > 2 · r

due to recursive Voronoi partitioningapply l-times for Ci0,...,il−1

range-pivot distance constraint

for leaf-level clusters Cp,∗ store min. and max. distance

rmax = max{d(p, o)|o ∈ Cp,∗}

skip accessing of cluster Cp,∗ if

d(p, q) + r < rmin or d(p, q)− r > rmax


M-Index Search Principles (cont.)

object-pivot distance constraint

the M-Index key contains the object-pivot distanceidentify interval of keys in cluster Ci0,...,il−1

[d(pi0 , q)− r , d(pi0 , q) + r ]

pivot filtering

store distances d(p0, o), . . . , d(pn−1, o) together with object oskip computation of d(q, o) at query time if

maxi∈{0,...,n−1}

|d(pi , q)− d(pi , o)| > r .


Approximate Strategy for M-Index

p3

p2

p0

p1

C3,2

C2,1

C1,2

C0,3

0,1C

q

C3,0

1,3C

C1,0

C2,3C3,1

0.2

0.3

0.25

0.5

Determine the order in which to visitindividual clusters

priority queue of clusters

heuristic which analyzes distancesd(p0, q), d(p1, q), . . . , d(pn−1, q)

each cluster is assigned a penalty

distance of the cluster from q

penalty(Ci0,...,il−1) =

l−1∑j=0

max{

d(pij , q)− d(p(j)q , q), 0}


Experimental Evaluation

200,000 objects from CoPhIR dataset

combination of five MPEG-7 descriptorsaltogether 280 dimensions, intrinsic dimensionality: 13

measure I/O costs (page reads and objects accessed), computationalcosts, response times

compare with iDistance, PM-tree (same implementation platform)

Distance computations performed during index construction (n = 20)

dataset size 20,000 80,000 140,000 200,000

M-Index 400,000 1,600,000 2,800,000 4,000,000

PM-Tree 1,205,538 6,299,207 11,627,729 16,897,996


Precise Strategy: Number of objects accessed

Data volume accessed for kNN(q, 50)

20000

40000

60000

80000

100000

0 10 20 30 40 50# of pivots

M−Index level 1M−Index level 2M−Index level 3Dynamic M−Index

# of

acc

esse

d ob

ject

s

0

dataset size: 100,000

dynamic M-Index: lmax = 5

0 20000 40000 60000 80000

100000 120000 140000 160000

0

# ac

cess

ed o

bjec

ts

dataset size

Accessed objects k=30

PM−TreeiDistanceM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index

50000 100000 150000 200000

20 pivots


Precise Strategy: I/O costs, response times


0 5000

10000 15000 20000 25000 30000 35000

0

# of

blo

ck r

eads

50000 100000 150000dataset size

I/O costs

200000

higher fragmentation with moreM-Index levels


0 200 400 600 800

1000 1200 1400 1600

0

Response times

50000 100000 150000 200000dataset size

avg

resp

onse

tim

e [m

s]

lower computational costs formore M-Index levels


Precise Strategy: I/O costs, response times


0 5000

10000 15000 20000 25000 30000 35000

0

# of

blo

ck r

eads

50000 100000 150000dataset size

I/O costs

200000

higher fragmentation with moreM-Index levels


0 200 400 600 800

1000 1200 1400 1600

0

Response times

50000 100000 150000 200000dataset size

avg

resp

onse

tim

e [m

s]lower computational costs formore M-Index levels


Approximate Strategy

0

20

40

60

80

100

0 5 10 15 20 25 30 35 40maximum percentage of accessed dat [%]

Recall for approximate kNN(q,30)


reca

ll [%

]

PM−Tree

dataset of 100,000 objects

PM−TreeM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index

0

20

40

60

80

100

0 50000 100000dataset size


reca

ll [%

]

150000 200000

algorithm accesses 10,000 objects


Approximate Strategy

0

20

40

60

80

100

0 5 10 15 20 25 30 35 40maximum percentage of accessed dat [%]



reca

ll [%

]

PM−Tree

dataset of 100,000 objects

PM−TreeM−Index level 1M−Index level 2M−Index level 3Dynamic M−Index

0

20

40

60

80

100

0 50000 100000dataset size


reca

ll [%

] 150000 200000

algorithm accesses 10,000 objects


Comparison with Purely Approximate Approaches

0

20

40

60

80

100

0 1% 3% 5%% database accessed

ideal cluster orderM−IndexSpearman footruleSpearman RhoPP−Index, z=1,000

Recall for kNN(q,30)

aver

age

reca

ll [%

]

2% 4%

dataset size: 1,000,000

32 pivots (n = 32)

dynamic M-Index withlmax = 6

Spearman Footrule modified to use permutation prefixes

Induced Footrule Distance [Amato, Savino: Approximate SimilaritySearch in Metric Spaces using Inverted Files]

Spearman Rho used in [Chavez, Figueroa, Navarro: EffectiveProximity Retrieval by Ordering Permutations]

PP-Index has a similar structure as M-Index

access subtrees with at least z = 1,000 objects


david novak and michal batko }w !#$%&'()+,-./012345

Documents