d-index: distance searching index for metric data sets · d-index: distance searching index for...

25
Multimedia Tools and Applications, 21, 9–33, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. D-Index: Distance Searching Index for Metric Data Sets VLASTISLAV DOHNAL xdohnal@fi.muni.cz Masaryk University Brno, Czech Republic CLAUDIO GENNARO [email protected] PASQUALE SAVINO [email protected] ISI-CNR, Via Moruzzi, 1, 56124, Pisa, Italy PAVEL ZEZULA zezula@fi.muni.cz Masaryk University Brno, Czech Republic Abstract. In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation. We have also compared D-Index with other index structures and demonstrated its superiority on several real-life data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high rate of delete/insert operations. Keywords: metric spaces, similarity search, index structures, performance evaluation 1. Introduction The concept of similarity searching based on relative distances between a query and database objects has become essential for a number of application areas, e.g. data mining, signal processing, geographic databases, information retrieval, or computational biology. They usually exploit data types such as sets, strings, vectors, or complex structures that can be exemplified by XML documents. Intuitively, the problem is to find similar objects with respect to a query object according to a domain specific distance measure. The problem can be formalized by the mathematical notion of the metric space, so the data elements are assumed to be objects from a metric space where only pairwise distances between the objects can be determined. The growing need to deal with large, possibly distributed, archives requires an indexing support to speedup retrieval. An interesting survey of indexes built on the metric distance paradigm can be found in [4]. The common assumption is that the costs to build and to maintain an index structure are much less important compared to the costs that are needed to execute a query. Though this is certainly true for prevalently static collections of data,

Upload: hoangtuyen

Post on 27-Mar-2019

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

Multimedia Tools and Applications, 21, 9–33, 2003c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

D-Index: Distance Searching Index for MetricData Sets

VLASTISLAV DOHNAL [email protected] University Brno, Czech Republic

CLAUDIO GENNARO [email protected] SAVINO [email protected], Via Moruzzi, 1, 56124, Pisa, Italy

PAVEL ZEZULA [email protected] University Brno, Czech Republic

Abstract. In order to speedup retrieval in large collections of data, index structures partition the data into subsetsso that query requests can be evaluated without examining the entire collection. As the complexity of moderndata types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a newindex structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searchingstrategy to speed up execution of similarity range and nearest neighbor queries for large files with objects storedin disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation.We have also compared D-Index with other index structures and demonstrated its superiority on several real-lifedata sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a highrate of delete/insert operations.

Keywords: metric spaces, similarity search, index structures, performance evaluation

1. Introduction

The concept of similarity searching based on relative distances between a query and databaseobjects has become essential for a number of application areas, e.g. data mining, signalprocessing, geographic databases, information retrieval, or computational biology. Theyusually exploit data types such as sets, strings, vectors, or complex structures that can beexemplified by XML documents. Intuitively, the problem is to find similar objects withrespect to a query object according to a domain specific distance measure. The problemcan be formalized by the mathematical notion of the metric space, so the data elementsare assumed to be objects from a metric space where only pairwise distances between theobjects can be determined.

The growing need to deal with large, possibly distributed, archives requires an indexingsupport to speedup retrieval. An interesting survey of indexes built on the metric distanceparadigm can be found in [4]. The common assumption is that the costs to build and tomaintain an index structure are much less important compared to the costs that are neededto execute a query. Though this is certainly true for prevalently static collections of data,

Page 2: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

10 DOHNAL ET AL.

such as those occurring in the traditional information retrieval, some other applications mayneed to deal with dynamic objects that are permanently subject of change as a function oftime. In such environments, updates are inevitable and the costs of update are becoming amatter of concern.

One important characteristic of distance based index structures is that, depending onspecific data and a distance function, the performance can be either I/O or CPU bound.For example, let us consider two different applications: in the first one, the similarity oftext strings of one hundred characters is computed by using the edit distance, which has aquadratic computational complexity; in the second application, one hundred dimensionalvectors are compared by the inner product, with a linear computational complexity. Sincethe text objects are four times shorter than the vectors and the distance computation on thevectors is one hundred times faster than the edit distance, the minimization of I/O accessesfor the vectors is much more important that for the text strings. On the contrary, the CPUcosts are more significant for computing the distance of text strings than for the distance ofvectors. In general, the strategy of indexes based on distances should be able to minimizeboth of the cost components.

An index structure supporting similarity retrieval should be able to execute similarityrange queries with any search radius. However, a significant group of applications needsretrieval of objects which are in a very near vicinity of the query object—e.g., copy (replica)detection, cluster analysis, DNA mutations, corrupted signals after their transmission overnoisy channels, or typing (spelling) errors. Thus in some cases, special attention may bewarranted.

In this article, we propose a similarity search structure, called D-Index, that is able toreduce both the I/O and CPU costs. The organization stores data objects in buckets withdirect access to avoid hierarchical bucket dependencies. Such organization also results ina very efficient insertion and deletion of objects. Though similarity constraints of queriescan be defined arbitrarily, the structure is extremely efficient for queries searching for veryclose objects.

A short preliminary version of this article is available in [8]. The major extensions concerngeneric search algorithms, an extensive experimental evaluation, and comparison with otherindex structures. The rest of the article is organized as follows. In Section 2, we define thegeneral principles of the D-Index. The system structure, algorithms, and design issues arespecified in Section 3, while experimental evaluation results can be found in Section 4. Thearticle concludes in Section 5.

2. Search strategies

A metric space M = (D, d) is defined by a domain of objects (elements, points) Dand a total (distance) function d—a non negative and symmetric function which satisfiesthe triangle inequality d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ D. Without any loss ofgenerality, we assume that the maximum distance never exceeds the maximum distanced+. In general, we study the following problem: Given a set X ⊆ D in the metric spaceM, pre-process or structure the elements of X so that similarity queries can be answeredefficiently.

Page 3: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11

For a query object q ∈ D, two fundamental similarity queries can be defined. A rangequery retrieves all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r}. Anearest neighbor query retrieves the k closest elements to q, that is a set A ⊆ X such that|A| = k and ∀x ∈ A, y ∈ X − A, d(q, x) ≤ d(q, y).

According to the survey [4], existing approaches to search general metric spaces can beclassified in two basic categories:

Clustering algorithms. The idea is to divide, often recursively, the search space into partitions(groups, sets, clusters) characterized by some representative information.

Pivot-based algorithms. The second trend is based on choosing t distinguished elementsfrom D, storing for each database element x distances to these t pivots, and using suchinformation for speeding up retrieval.

Though the number of existing search algorithms is impressive, see [4], the search costs arestill high, and there is no algorithm that is able to outperform the others in all situations.Furthermore, most of the algorithms are implemented as main memory structures, thus donot scale up well to deal with large data files.

The access structure we propose in this paper, called D-Index, synergistically combinesmore principles into a single system in order to minimize the amount of accessed data,as well as the number of distance computations. In particular, we define a new techniqueto recursively cluster data in separable partitions of data blocks and we combine it withpivot-based strategies to decrease the I/O costs. In the following, we first formally specifythe underlying principles of our clustering and pivoting techniques and then we presentexamples to illustrate the ideas behind the definitions.

2.1. Clustering through separable partitioning

To achieve our objectives, we base our partitioning principles on a mapping function, whichwe call the ρ-split function, where ρ is a real number constrained as 0 ≤ ρ < d+. In orderto gradually explain the concept of ρ-split functions, we first define a first order ρ-splitfunction and its properties.

Definition 1. Given a metric space (D, d), a first order ρ-split function s1,ρ is the mappings1,ρ :D → {0, 1, −}, such that for arbitrary different objects x, y ∈ D, s1,ρ(x) = 0 ∧s1,ρ(y) = 1 ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ s1,ρ2 (x) �= −∧ s1,ρ1 (y) =− ⇒ d(x, y) > ρ2 − ρ1 (symmetry property).

In other words, the ρ-split function assigns to each object of the space D one of the symbols0, 1, or −. Moreover, the function must hold the separable and symmetry properties. Themeaning and the importance of these properties will be clarified later.

We can generalize the ρ-split function by concatenating n first order ρ-split functionswith the purpose of obtaining a split function of order n.

Definition 2. Given n first order ρ-split functions s1,ρ

1 , . . . , s1,ρn in the metric space (D, d),

a ρ-split function of order n sn,ρ = (s1,ρ

1 , s1,ρ

2 , . . . , s1,ρn ) :D → {0, 1, −}n is the mapping,

Page 4: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

12 DOHNAL ET AL.

such that for arbitrary different objects x, y ∈ D, ∀ i s1,ρ

i (x) �= − ∧ ∀ j s1,ρ

j (y) �= − ∧sn,ρ(x) �= sn,ρ(y) ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ ∀ i s1,ρ2

i (x) �=− ∧ ∃ j s1,ρ1

j (y) = − ⇒ d(x, y) > ρ2 − ρ1 (symmetry property).

An obvious consequence of the ρ-split function definitions, useful for our purposes, isthat by combining n ρ-split functions of order 1 s1,ρ

1 , . . . , s1,ρn , which have the separable and

symmetric properties, we obtain a ρ-split function of order n sn,ρ , which also demonstratesthe separable and symmetric properties. We often refer to the number of symbols generatedby sn,ρ , that is the parameter n, as the order of the ρ-split function. In order to obtainan addressing scheme, we need another function that transforms the ρ-split strings intointegers, which we define as follows.

Definition 3. Given a string b = (b1, . . . , bn) of n elements 0, 1, or −, the function〈·〉 : {0, 1, −}n → [0..2n] is specified as:

〈b〉 =

[b1, b2, . . . , bn]2 =n∑

j=1

2 j−1b j , if ∀ j b j �= −

2n, otherwise

When all the elements are different from ‘−’, the function 〈b〉 simply translates the stringb into an integer by interpreting it as a binary number (which is always <2n), otherwise thefunction returns 2n .

By means of the ρ-split function and the 〈b〉 operator we can assign an integer numberi (0 ≤ i ≤ 2n) to each object x ∈ D, i.e., the function can group objects from X ⊂ D in2n + 1 disjoint sub-sets.

Assuming a ρ-split function sn,ρ , we use the capital letter Sn,ρ to designate partitions(sub-sets) of objects from X , which the function can produce. More precisely, given a ρ-split function sn,ρ and a set of objects X , we define Sn,ρ

[i] (X ) = {x ∈ X | 〈sn,ρ(x)〉 = i}. Wecall the first 2n sets the separable sets. The last set, that is the set of objects for which thefunction 〈sn,ρ(x)〉 evaluates to 2n , is called the exclusion set.

For illustration, given two split functions of order 1, s1,ρ

1 and s1,ρ

2 , a ρ-split function oforder 2 produces strings as a concatenation of strings returned by functions s1,ρ

1 and s1,ρ

2 .The combination of the ρ-split functions can also be seen from the perspective of the

object data set. Considering again the illustrative example above, the resulting function oforder 2 generates an exclusion set as the union of the exclusion sets of the two first ordersplit functions, i.e. (S1,ρ

1[2](D) ∪ S1,ρ

2[2](D)). Moreover, the four separable sets, which are givenby the combination of the separable sets of the original split functions, are determined asfollows: {S1,ρ

1[0](D) ∩ S1,ρ

2[0](D), S1,ρ

1[0](D) ∩ S1,ρ

2[1](D), S1,ρ

1[1](D) ∩ S1,ρ

2[0](D), S1,ρ

1[1](D) ∩ S1,ρ

2[1](D)}.The separable property of the ρ-split function allows to partition a set of objects X in

sub-sets Sn,ρ

[i] (X ), so that the distance from an object in one sub-set to another object in adifferent sub-set is more than 2ρ, which is true for all i < 2n . We say that such disjointseparation of sub-sets, or partitioning, is separable up to 2ρ. This property will be usedduring the retrieval, since a range query with radius ≤ρ requires to access only one of theseparable sets and, possibly the exclusion set. For convenience, we denote a set of separable

Page 5: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 13

partitions {Sn,ρ

[0] (D), . . . , Sn,ρ

[2n−1](D)} as {Sn,ρ

[·] (D)}. The last partition Sn,ρ

[2n ](D) is the exclusionset.

When the ρ parameter changes, the separable partitions shrink or widen correspondingly.The symmetric property guarantees a uniform reaction in all the partitions. This propertyis essential in the similarity search phase as detailed in Sections 3.2 and 3.3.

2.2. Pivot-based filtering

In general, the pivot-based algorithms can be viewed as a mapping F from the originalmetric space (D, d) to a t-dimensional vector space with the L∞ distance. The map-ping assumes a set T = {p1, p2, . . . , pt } of objects from D, called pivots, and for eachdatabase object o, the mapping determines its characteristic (feature) vector as F(o) =(d(o, p1), d(o, p2), . . . , d(o, pt )). We designate the new metric space as MT (Rt , L∞). Atsearch time, we compute for a query object q the query feature vector F(q) = (d(q, p1),d(q, p2), . . . , d(q, pt )) and discard for the search radius r an object o if

L∞(F(o), F(q)) > r (1)

In other words, the object o can be discarded if for some pivot pi ,

|d(q, pi ) − d(o, pi )| > r (2)

Due to the triangle inequality, the mapping F is contractive, that is all discarded objectsdo not belong to the result set. However, some not-discarded objects may not be relevantand must be verified through the original distance function d(·). For more details, see forexample [3].

2.3. Illustration of the idea

Before we specify the structure and the insertion/search algorithms of D-Index, we illustratethe ρ-split clustering and pivoting techniques by simple examples.

Though several different types of first order ρ-split functions are proposed, analyzed, andevaluated in [7], the ball partitioning split (bps) originally proposed in [12] under the nameexcluded middle partitioning, provided the smallest exclusion set. For this reason, we alsoapply this approach, which can be characterized as follows.

The ball partitioning ρ-split function bpsρ(x, xv) uses one object xv ∈ D and the mediumdistance dm to partition the data file into three subsets BPS1,ρ

[0] , BPS1,ρ

[1] and BPS1,ρ

[2] —theresult of the following function gives the index of the set to which the object x belongs (seefigure 1).

bps1,ρ(x) =

0 if d(x, xv) ≤ dm − ρ

1 if d(x, xv) > dm + ρ

− otherwise

(3)

Page 6: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

14 DOHNAL ET AL.

Figure 1. The excluded middle partitioning.

Note that the median distance dm is relative to xv and it is defined so that the number ofobjects with distances smaller than dm is the same as the number of objects with distanceslarger than dm . For more details see [11].

In order to see that the bpsρ-split function behaves in the desired way, it is sufficient toprove that the separable and symmetric properties hold.

Separable property. We have to prove that ∀ x0 ∈ BPS1,ρ

[0] , x1 ∈ BPS1,ρ

[1] : d(x0, x1) > 2ρ.If we consider Definition 1 of the split function, we can write d(x0, xv) ≤ dm − ρ

and d(x1, xv) > dm + ρ. Since the triangle inequality among x0, x1, xv holds, we getd(x0, x1) + d(x0, xv) ≥ d(x1, xv). If we combine these inequalities and simplify theexpression, we get d(x0, x1) > 2ρ.

Symmetric property. We have to prove that ∀ x0 ∈ BPS1,ρ2[0] , x1 ∈ BPS1,ρ2

[1] , y ∈ BPS1,ρ1[2] :

d(x0, y) > ρ2 − ρ1 and d(x1, y) > ρ2 − ρ1 with ρ2 ≥ ρ1. We prove it only for the objectx0 because the proof for x1 is analogous. Since y ∈ BPS1,ρ1

[2] then d(y, xv) > dm − ρ1.Moreover, since x0 ∈ BPS1,ρ2

[0] then dm − ρ2 ≥ d(x0, xv). By summing both the sides ofthe above inequalities we obtain d(y, xv)−d(x0, xv) > ρ2 −ρ1. Finally, from the triangleinequality we have that d(x0, y) > d(y, xv) − d(x0, xv) and then d(x0, y) ≥ ρ2 − ρ1.

As explained in Section 2.1, once we have defined a set of first order ρ-split functions, itis possible to combine them in order to obtain a function which generates more partitions.This idea is depicted in figure 2, where two bps-split functions, on the two-dimensionalspace, are used. The domain D, represented by the grey square, is divided into four regions{S2,ρ

[·] (D)} = {S2,ρ

[0] (D), S2,ρ

[1] (D), S2,ρ

[2] (D), S2,ρ

[3] (D)}, corresponding to the separable parti-tions. The partition S2,ρ

[4] (D), the exclusion set, is represented by the brighter region and itis formed by the union of the exclusion sets resulting from the two splits.

Page 7: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 15

Figure 2. Clustering through partitioning in the two-dimensional space.

Figure 3. Example of pivots behavior.

In figure 3, the basic principle of the pivoting technique is illustrated, where q is the queryobject, r the query radius, and pi a pivot. Provided the distance between any object and pi isknown, the gray area represents the region of objects x , that do not belong to the query result.This can easily be decided without actually computing the distance between q and x , by usingthe triangle inequalities d(x, pi ) + d(x, q) ≥ d(pi , q) and d(pi , q) + d(x, q) ≥ d(x, pi ),and respecting the query radius r . Naturally, by using more pivots, we can improve theprobability of excluding an object without actually computing its distance with respect to q.

3. System structure and algorithms

In this section we specify the structure of the Distance Searching Index, D-Index, and discussthe problems related to the definition of ρ-split functions and the choice of reference objects.

Page 8: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

16 DOHNAL ET AL.

Then we specify the insertion algorithm and outline the principle of searching. Finally, wedefine the generic similarity range and nearest neighbor algorithms.

3.1. Storage architecture

The basic idea of the D-Index is to create a multilevel storage and retrieval structure thatuses several ρ-split functions, one for each level, to create an array of buckets for storingobjects. On the first level, we use a ρ-split function for separating objects of the whole dataset. For any other level, objects mapped to the exclusion bucket of the previous level arethe candidates for storage in separable buckets of this level. Finally, the exclusion bucket ofthe last level forms the exclusion bucket of the whole D-Index structure. It is worth notingthat the ρ-split functions of individual levels use the same ρ. Moreover, split functions canhave different order, typically decreasing with the level, allowing the D-Index structure tohave levels with a different number of buckets. More precisely, the D-Index structure canbe defined as follows.

Definition 4. Given a set of objects X , the h-level distance searching index DIρ(X, m1,

m2, . . . , mh) with the buckets at each level separable up to 2ρ, is determined by h indepen-dent ρ-split functions smi ,ρ

i (i = 1, 2, . . . , h) which generate:

Exclusion bucket Ei ={

Sm1,ρ

1[2m1 ](X ) if i = 1

Smi ,ρ

i[2mi ](Ei−1) if i > 1

Separable buckets {Bi,0, Bi,1, . . . , Bi,2mi −1} ={{

Sm1,ρ

1[·] (X )}

if i = 1{Smi ,ρ

i[·] (Ei−1)}

if i > 1

From the structure point of view, you can see the buckets organized as the following twodimensional array consisting of 1 + ∑h

i=1 2mi elements.

B1,0, B1,1, . . . , B1,2m1 −1

B2,0, B2,1, . . . , B2,2m2 −1

...

Bh,0, Bh,1, . . . , Bh,2mh −1, Eh

All separable buckets are included, but only the Eh exclusion bucket is present—exclusionbuckets Ei<h are recursively re-partitioned on level i + 1. Then, for each row (i.e. the D-Index level) i, 2mi buckets are separable up to 2ρ thus we are sure that do not exist twobuckets at the same level i both containing relevant objects for any similarity range querywith radius rx ≤ ρ. Naturally, there is a tradeoff between the number of separable bucketsand the amount of objects in the exclusion bucket—the more separable buckets there are,the greater the number of objects in the exclusion bucket is. However, the set of objects inthe exclusion bucket can recursively be re-partitioned, with possibly different number ofbits (mi ) and certainly different ρ-split functions.

Page 9: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 17

3.2. Insertion and search strategies

In order to complete our description of the D-Index, we present an insertion algorithm anda sketch of a simplified search algorithm. Advanced search techniques are described in thenext section.

Insertion of object x ∈ X in DIρ(X, m1, m2, . . . , mh) proceeds according to the followingalgorithm.

Algorithm 1. Insertion

for i = 1 to hif

⟨smi ,ρ

i (x)⟩< 2mi

x �→ Bi,〈smi ,ρi (x)〉;

exit;end if

end forx �→ Eh ;

Starting from the first level, Algorithm 1 tries to accommodate x into a separable bucket. Ifa suitable bucket exists, the object is stored in this bucket. If it fails for all levels, the object xis placed in the exclusion bucket Eh . In any case, the insertion algorithm determines exactlyone bucket to store the object.

Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ ρ, a simple algorithm canexecute the query as follows.

Algorithm 2. Search

for i = 1 to hreturn all objects x such that x ∈ Q ∩ Bi,〈smi ,0

i (q)〉;end forreturn all objects x such that x ∈ Q ∩ Eh ;

The function 〈smi ,0i (q)〉 always gives a value smaller than 2mi , because ρ = 0. Consequently,

one separable bucket on each level i is determined. Objects from the query response setcannot be in any other separable bucket on the level i , because rq is not greater than ρ andthe buckets are separable up to 2ρ. However, some of them can be in the exclusion zone,and Algorithm 2 assumes that exclusion buckets are always accessed. That is why all levelsare considered, and also the exclusion bucket Eh is accessed. The execution of Algorithm 2requires h + 1 bucket accesses, which forms the upper bound of a more sophisticatedalgorithm described in the next section.

3.3. Generic search algorithms

The search Algorithm 2 requires to access one bucket at each level of the D-Index, plusthe exclusion bucket. In the following two situations, however, the number of accesses can

Page 10: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

18 DOHNAL ET AL.

even be reduced:

– if the query region is contained in the exclusion partition of the level i , then the querycannot have objects in the separable buckets of this level and only the next level, if itexists, must be considered;

– if the query region is contained in a separable partition of the level i , the following levels,as well as the exclusion bucket for i = h, need not be accessed, thus the search terminateson this level.

Another drawback of the simple algorithm is that it works only for search radii up to ρ.However, with additional computational effort, queries with rq > ρ can also be executed.Indeed, queries with rq > ρ can be executed by evaluating the split function srq−ρ . In casesrq−ρ returns a string without any ‘−’, the result is contained in a single bucket (namelyB〈srq −ρ 〉) plus, possibly, the exclusion bucket.

Let us now consider that the string returned contains at least one ‘−’. We indicate thisstring as (b1, . . . , bn) with bi = {0, 1, −}. In case there is only one bi = ‘−’, we must accessall buckets B, whose index is obtained by substituting in srq−ρ the ‘−’ with 0 and 1. In themost general case we must substitute in (b1, . . . , bn) all the ‘−’ with zeros and ones andgenerate all possible combinations. A simple example of this concept is illustrated in figure 4.

In order to define an algorithm for this process, we need some additional terms andnotation.

Definition 5. We define an extended exclusive OR bit operator, ⊗, which is based on thefollowing truth table:

bit1 0 0 0 1 1 1 − − −bit2 0 1 − 0 1 − 0 1 −⊗ 0 1 0 1 0 0 0 0 0

Figure 4. Example of use of the function G.

Page 11: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 19

Notice that the operator ⊗ can be used bit-wise on two strings of the same length and that italways returns a standard binary number (i.e., it does not contain any ‘−’). Consequently,〈s1 ⊗ s2〉 < 2n is always true for strings s1 and s2 of length n (see Definition 3).

Definition 6. Given a string s of length n, G(s) denotes a sub-set of �n = {0, 1, . . . , 2n−1},such that all elements ei ∈ �n for which 〈s ⊗ ei 〉 �= 0 are eliminated (interpreting ei as abinary string of length n).

Observe that, G(− − · · · −) = �n , and that the cardinality of the set is the second powerof the number of ‘−’ elements, that is generating all partition identifications in whichthe symbol ‘−’ is alternatively substituted by zeros and ones. In fact, we can use G(·)to generate, from a string returned by a split function, the set of partitions that need beaccessed to execute the query. As an example let us consider that n = 2 as in figure 4. Ifs2,rq−ρ = (−1) then we must access buckets B1 and B3. In such a case G(−1) = {1, 3}. Ifs2,rq−ρ = (−−) then we must access all buckets, as G(−−) = {0, 1, 2, 3} indicates.

We only apply the function G when the search radius rq is greater than the parameter ρ.We first evaluate the split function using rq −ρ (which is greater than 0). Next, the functionG is applied to the resulting string. In this way, G generates the partitions that intersect thequery region.

In the following, we specify how such ideas are applied in the D-Index to implementgeneric range and nearest neighbor algorithms.

3.3.1. Range queries. Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ d+. Anadvanced algorithm can execute the similarity range query as follows.

Algorithm 3. Range search

01. for i = 1 to h02. if 〈smi ,ρ+rq

i (q)〉 < 2mi then (exclusive containment in a separable bucket)03. return all objects x such that x ∈ Q ∩ Bi,〈smi ,ρ+rq

i (q)〉;exit;

04. end if05. if rq ≤ ρ then (search radius up to ρ)06. if 〈smi ,ρ−rq

i (q)〉 < 2mi then (not exclusive containment in an exclusion bucket)07. return all objects x such that x ∈ Q ∩ Bi,〈smi ,ρ−rq

i (q)〉;08. end if09. else (search radius greater than ρ)10. let {l1, l2, . . . , lk} = G(s

mi ,rq−ρ

i (q))11. return all objects x such that x ∈ Q ∩ Bi,l1 or x ∈ Q ∩ Bi,l2 or . . . or

x ∈ Q ∩ Bi,lk ;12. end if13. end for14. return all objects x such that x ∈ Q ∩ Eh ;

Page 12: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

20 DOHNAL ET AL.

In general, Algorithm 3 considers all D-Index levels and eventually also accesses the globalexclusion bucket. However, due to the symmetry property, the test on the line 02 can discoverthe exclusive containment of the query region in a separable bucket and terminate the searchearlier. Otherwise, the algorithm proceeds according to the size of the query radius. If itis not greater than ρ, line 05, there are two possibilities. If the test on line 06 is satisfied,one separable bucket is accessed. Otherwise no separable bucket is accessed on this level,because the query region is, from this level point of view, exclusively in the exclusionzone. Provided the search radius is greater than ρ, more separable buckets are accessedon a specific level as defined by lines 10 and 11. Unless terminated earlier, the algorithmaccesses the exclusion bucket at line 14.

3.3.2. Nearest neighbor queries. The task of the nearest neighbor search is to retrieve kclosest elements from X to q ∈ D, respecting the metric M. Specifically, it retrieves a setA ⊆ X such that |A| = k and ∀ x ∈ A, y ∈ X − A, d(q, x) ≤ d(q, y). In case of ties, weare satisfied with any set of k elements complying with the condition. For convenience, wedesignate the distance to the k-th nearest neighbor as dk (dk ≤ d+).

The general strategy of Algorithm 4 works as follows. At the beginning (line 01), weassume that the response set A is empty and dk = d+. The algorithm then proceeds withan optimistic strategy (lines 02 through 13) assuming that the k-th nearest neighbor is atdistance maximally ρ. If it fails, see the test on line 14, additional steps of search areperformed to find the correct result. Specifically, the first phase of the algorithm determinesthe buckets (maximally one separable at each level) by using the range query strategy withradius r = min{dk, ρ}. In this way, the most promising buckets are accessed and theiridentifications are remembered to avoid multiple bucket accesses. If dk ≤ ρ the searchterminates successfully and A contains the result. Otherwise, additional bucket accesses areperformed (lines 15 through 22) using the strategy for radii greater than ρ of Algorithm 3,ignoring the buckets already accessed in the first (optimistic) phase.

Algorithm 4. Nearest neighbor search

01. A = ∅, dk = d+; (initialization)02. for i = 1 to h (first - optimistic - phase)03. r = min{dk, ρ};04. if 〈smi ,ρ+r

i (q)〉 < 2mi then (exclusive containment in a separable bucket)05. access bucket Bi,〈smi ,ρ+r

i (q)〉; update A and dk ;06. if dk ≤ ρ then exit; (the response set is determined)07. else08. if 〈smi ,ρ−r

i (q)〉 < 2mi (not exclusive containment in an exclusion bucket)09. access bucket Bi,〈smi ,0

i (q)〉; update A and dk ;

10. end if11. end if12. end for13. access bucket Eh ; update A and dk ;14. if dk > ρ (second phase - if needed)

Page 13: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 21

15. for i = 1 to h16. if 〈smi ,ρ+dk

i (q)〉 < 2mi then (exclusive containment in a separable bucket)17. access bucket Bi,〈smi ,ρ+dk

i (q)〉 if not already accessed; update A and dk ; exit;18. else19. let {b1, b2, . . . , bk} = G

(smi ,dk−ρ

i (q))

20. access buckets Bi,b1 , . . . , Bi,b2 , Bi,bk if not accessed; update A and dk ;21. end if22. end for23. end if

The output of the algorithm is the result set A and the distance to the last (k-th) nearestneighbor dk . Whenever a bucket is accessed, the response set A and the distance to the k-thnearest neighbor must be updated. In particular, the new response set is determined as theresult of the nearest neighbor query over the union of objects from the accessed bucket andthe previous response set A. The value of dk , which is the current search radius in the givenbucket, is adjusted according to the new response set. Notice that the pivots are used duringthe bucket accesses, as it is explained in the next subsection.

3.4. Design issues

In order to apply D-Index, the ρ-split functions must be designed, which obviously requiresspecification of reference (pivot) objects. An important feature of D-Index is that the samereference objects used in the ρ-split functions for the partitioning into buckets are also usedas pivots for the internal management of buckets.

3.4.1. Choosing references. The problem of choosing reference objects (pivots) is impor-tant for any search technique in the general metric space, because all such algorithms need,directly or indirectly, some “anchors” for partitioning and search pruning. It is well knownthat the way in which pivots are selected can affect the performance of proper algorithms.This has been recognized and demonstrated by several researchers, e.g. [11] or [1], butspecific suggestions for choosing good reference objects are rather vague. In this article, weuse the same sets of reference objects both for the separable partitioning and the pivot-basedfiltering.

Recently, the problem was systematically studied in [2], and several strategies for select-ing pivots have been proposed and tested. The generic conclusion is that the mean µMT

of distance distribution in the feature space MT should be as high as possible, which isformalized by the hypothesis that the set of pivots T1 = {p1, p2, . . . , pt } is better than theset T2 = {p′

1, p′2, . . . , p′

t } if

µMT1> µMT2

(4)

However, the problem is how to find the mean for given pivot set T . An estimate of suchquantity is proposed in [2] and computed as:

Page 14: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

22 DOHNAL ET AL.

– at random, choose l pairs of elements {(o1, o′1), (o2, o′

2), . . . , (ol , o′l)} from D;

– for all pairs of elements, compute their distances in the feature space determined by thepivot set T ;

– compute µMT as the mean of these distances.

We apply the incremental selection strategy which was found in [2] as the most suitablefor real world metric spaces, which works as follows. At the beginning, a set T = {p1} ofone element from a sample of m database elements is chosen, such that the pivot p1 hasthe maximum µMT value. Then, a second pivot p2 is chosen from another sample of melements of the database, creating a new set K = {p1, p2} for fixed p1, maximizing µMT .The third pivot p3 is chosen from another sample of m elements of the database, creatinganother set T = {p1, p2, p3} for fixed p1, p2, maximizing µMT . The process is repeateduntil t pivots are determined.

In order to verify the capability of the incremental (INC) strategy, we have also triedto select pivots at random (RAN) and with a modified incremental strategy, called themiddle search strategy (MID). The MID strategy combines the original INC approachwith the following idea. The set of pivots T1 = {p1, p2, . . . , pt } is better than the setT2 = {p′

1, p′2, . . . , p′

t } if |µT1 − dm | < |µT2 − dm |, where dm represents the average globaldistance in a given data set and µT is the average distance in the set T . In principle, thisstrategy supports choices of pivots with high µMT but also tries to keep distances betweenpairs of pivots close to dm . The hypothesis is that too close or too distant pairs of pivots arenot suitable for good partitioning of metric spaces.

3.4.2. Elastic bucket implementation. Since a uniform bucket occupation is difficult toguarantee, a bucket consists of a header plus a dynamic list of fixed size blocks of capacitythat is sufficient to accommodate any object from the processed file. The header is charac-terized by a set of pivots and it contains distances to these pivots from all objects stored inthe bucket. Furthermore, objects are sorted according to their distances to the first pivot p1

to form a non-decreasing sequence. In order to cope with dynamic data, a block overflowis solved by splitting.

Given a query object q and search radius r a bucket evaluation proceeds in the followingsteps:

– ignore all blocks containing objects with distances to p1 outside the interval [d(q, p1) −r, d(q, p1) + r ];

– for each not ignored block, if ∀ oi , L∞(F(oi ), F(q)) > r then ignore the block;– access remaining blocks and ∀ oi , L∞(F(oi ), F(q)) ≤ r compute d(q, oi ). If d(q, oi ) ≤

r , then oi qualifies.

In this way, the number of accessed blocks and necessary distance computations areminimized.

3.5. Properties of D-index

On a qualitative level, the most important properties of D-Index can be summarized asfollows:

Page 15: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 23

– An object is typically inserted at the cost of one block access, because objects in buck-ets are sorted. Since blocks are of fixed length, a block can overflow and a split of anoverloaded block costs another block access. The number of distance computations be-tween the reference (pivot) and inserted objects varies from m1 to

∑hi=1 mi . For an object

inserted on level j,∑ j

i=1 mi distance computations are needed.– For all response sets such that the distance to the most dissimilar object does not exceed

ρ, the number of bucket accesses is maximally h + 1. The smaller the search radius, themore efficient the search can be.

– For r = 0, i.e. the exact match, the successful search is typically solved with one blockaccess, unsuccessful search usually does not require any access—very skewed distancedistributions may result in slightly more accesses.

– For search radii greater than ρ, the number of bucket accesses is higher than h + 1, andfor very large search radii, the query evaluation can eventually access all buckets.

– The number of accessed blocks in a bucket depends on the search radius and the distancedistribution with respect to the first pivot.

– The number of distance computations between the query object and pivots is determinedby the highest level of an accessed bucket. Provided the level is j , the number of distancecomputations is

∑ ji=1 mi , with the upper bound for any kind of query

∑hi=1 mi .

– The number of distance computations between the query object and data objects stored inaccessed buckets depends on the search radii and the number of pre-computed distancesto pivots.

From the quantitative point of view, we investigate the performance of D-Index in thenext section.

4. Performance evaluation and comparison

We have implemented the D-Index and conducted numerous experiments to verify its prop-erties on the following three metric data sets:

VEC 45-dimensional vectors of image color features compared by the quadratic distancemeasure respecting correlations between individual colors.

URL sets of URL addresses attended by users during work sessions with the MasarykUniversity information system. The distance measure used was based on set similarity,defined as the fraction of cardinalities of intersection and union (Jacard coefficient).

STR sentences of a Czech national corpus compared by the edit distance measure that countsthe minimum number of insertions, deletions or substitutions to transform one string intothe other.

We have chosen these data sets not only to demonstrate the wide range of applicability of theD-Index, but also to show its performance over data with significantly different and realisticdata (distance) distributions. For illustration, see figure 5 for distance densities of all ourdata sets. Notice the practically normal distribution of VEC, very discrete distribution ofURL, and the skewed distribution of STR.

Page 16: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

24 DOHNAL ET AL.

0

0.004

0.008

0 2000 4000 6000 8000

Distance

Frequency VEC

0

0.08

0.16

0 0.2 0.4 0.6 0.8 1

Distance

Frequency URL

0

0.004

0.008

0 200 600 1000 1600

Distance

Frequency STR

Figure 5. Distance densities for VEC, URL, and STR.

In all our experiments, the query objects are not chosen from the indexed data sets, but theyfollow the same distance distribution. The search costs are measured in terms of distancecomputations and block reads. We deliberately ignore the L∞ distance measures used bythe D-Index for the pivot-based filtering, because according to our tests, the costs to computesuch distances are several orders of magnitude smaller than the costs needed to computeany of the distance functions of our experiments. All presented cost values are averagesobtained by executing queries for 50 different query objects and constant search selectivity,that is queries using the same search radius or the same number of nearest neighbors.

4.1. D-Index performance evaluation

In order to test the basic properties of the D-Index, we have considered about 11000 objectsfor each of our data sets VEC, URL, and STR. Notice that the number of object used doesnot affect the significance of the results, since the cost measures (i.e., the number of blockreads and the number of the distance computations) are not influenced by the cache of thesystem. Moreover, as shown in the following, performance scalability is linear with data setsize. First, we have tested the efficiency of the incremental (INC), the random (RAN), andthe middle (MID) strategies to select good reference objects. For each set of the selectedreference objects and specific data set, we have built D-Index organizations of pre-selectedstructures. In Table 1, we summarize characteristics of resulting D-Index organizations,separately for individual data sets.

We measured the search efficiency for range queries by changing the search radius toretrieve maximally 20% of data, that is about 2000 objects. The results, presented in graphs

Table 1. D-Index organization parameters.

Set h No. of buckets No. of blocks Block size ρ

VEC 9 21 2600 1 KB 1200

URL 9 21 228 13 KB 0.225

STR 9 21 289 6 KB 14

Page 17: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 25

0

1000

2000

3000

4000

5000

6000

7000

0 2 4 6 8 10 12 14 16 18 20

Search radius (x100)

Distance Computations VEC

INCMID

RAN

0

2000

4000

6000

8000

10000

12000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Search radius

Distance Computations URL

INCMIDRAN

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70

Search radius

Distance Computations STR

INCMIDRAN

Figure 6. Search efficiency in the number of distance computations for VEC, URL, and STR.

0

200

400

600

800

1000

1200

1400

1600

1800

0 2 4 6 8 10 12 14 16 18 20

Search radius (x100)

Page Reads VEC

INCMIDRAN

0

50

100

150

200

250

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Search radius

Page Reads URL

INCMIDRAN

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70

Search radius

Page Reads STR

INCMID

RAN

Figure 7. Search efficiency in the number of block reads for VEC, URL, and STR.

for all three data sets, are available in figure 6 for the distance computations and in figure 7for the block reads.

The general conclusion is that the incremental method proved to be the most suitable forchoosing reference objects for the D-Index. Except for very few search cases, the methodwas systematically better and can certainly be recommended—for the sentences, the middletechnique was nearly as good as the incremental, and for the vectors, the differences inperformance of all three tested methods were the least evident. An interesting observationis that for vectors with distribution close to uniform, even the randomly selected referencepoints worked quite well. However, when the distance distribution was significantly differentfrom the uniform, the performance with the randomly selected reference points was poor.

Notice also some differences in performance between the execution costs quantified inblock reads and distance computations. But the general trends are similar and the incrementalmethod is the best. However, different cost quantities are not equally important for all thedata sets. For example, the average time for quadratic form distances for the vectors werecomputed in much less than 90 µsec, while the average time spent computing a distancebetween two sentences was about 4.5 msec. This implies that the reduction of block readsfor vectors was quite important in the total search time, while for the sentences, it was notimportant at all.

Page 18: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

26 DOHNAL ET AL.

Figure 8. Nearest neighbor search on sentences using different D-Index structures.

Figure 9. Range search on sentences using different D-Index structures.

The objective of the second group of tests was to investigate the relationship betweenthe D-Index structure and the search efficiency. We have tested this hypothesis on the dataset STR, using ρ = 1, 14, 25, 60. The results of experiments for the nearest neighborsearch are reported in figure 8 while performance for range queries is summarized infigure 9. It is quite obvious that very small ρ can only be suitable when range search withsmall radius is used. However, this was not typical for our data where the distance to anearest neighbor was rather high—see in figure 8 the relatively high costs to get the nearestneighbor. This implies that higher values of ρ, such as 14, are preferable. However, thereare limits to increase such value, since the structure with ρ = 25 was only exceptionallybetter than the one with ρ = 14, and the performance with ρ = 60 was rather poor. Theexperiments demonstrate that a proper choice of the structure can significantly influencethe performance. The selection of optimized parameters is still an open research issue, thatwe plan to investigate in the near future.

In the last group of tests, we analyzed the influence of the block size on system per-formance. We have experimentally studied the problem on the VEC data set both for thenearest neighbor and the range search. The results are summarized in figure 10.

In order to obtain an objective comparison, we express the cost as relative block reads,i.e. the percentage of blocks that was necessary to execute a query.

Page 19: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 27

0

10

20

30

40

50

60

70

80

1 5 10 20 50 100

Nearest neighbors

Page Reads (%)

block=400block=1024

block=655360

10

20

30

40

50

60

70

0 5 10 15 20

Search radius (x100)

Page Reads (%)

block=400block=1024

block=65536

Figure 10. Search efficiency versus the block size.

We can conclude from the experiments that the search with smaller blocks requires toread a smaller fraction of the entire data structure. Notice that the number of distancecomputations for different block sizes is practically constant and depends only on thequery. However, a small block size implies a large number of blocks; this means thatin case the access cost to a block is significant (e.g., if blocks are allocated randomly)than large block sizes may become preferable. However, when the blocks are stored ina continuous storage space, small blocks can be preferable, because an optimized strat-egy for reading a set of disk pages, such as that proposed in [10], can be applied. For astatic file, we can easily achieve such storage allocation of the D-Index through a simplereorganization.

4.2. D-Index performance comparison

We have also compared the performance of D-Index with other index structures under thesame workload. In particular, we considered the M-tree1 [5] and the sequential organization,SEQ, because, according to [4], these are the only types of index structures for metric datathat use disk memories to store objects. In order to maximize objectivity of comparison,all three types of index structures are implemented using the GIST package [9], and theactual performance is measured in terms of the number of distance comparisons and I/Ooperations—the basic access unit of disk memory for each data set was the same for allindex structures.

The main objective of these experiments was to compare the similarity search efficiencyof the D-Index with the other organizations. We have also considered the space efficiency,costs to build an index, and the scalability to process growing data collections.

4.2.1. Search efficiency for different data sets. We have built the D-Index, M-tree, andSEQ for all the data sets VEC, URL, and STR. The M-tree was built by using the bulk loadpackage and the Min-Max splitting strategy. This approach was found in [5] as the mostefficient to construct the tree. We have measured average performance over 50 differentquery objects considering numerous similarity range and nearest neighbor queries. The

Page 20: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

28 DOHNAL ET AL.

0

2000

4000

6000

8000

10000

12000

0 5 10 15 20

Search radius (x100)

Distance Computations VEC

dindexmtree

seq

0

2000

4000

6000

8000

10000

12000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Search radius

Distance Computations URL

dindexmtree

seq

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40 50 60 70

Search radius

Distance Computations STR

dindexmtree

seq

Figure 11. Comparison of the range search efficiency in the number of distance computations for VEC, URL,and STR.

0100200300400500600700800900

1000

0 5 10 15 20

Search radius (x100)

Page Reads VEC

dindexmtree

seq

0

20

40

60

80

100

120

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Search radius

Page Reads URL

dindexmtree

seq

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70

Search radius

Page Reads STR

dindexmtree

seq

Figure 12. Comparison of the range search efficiency in the number of block reads for VEC, URL, and STR.

0

2000

4000

6000

8000

10000

12000

1 5 10 20 50 100

Nearest neighbors

Distance Computations VEC

dindexmtree

seq

0

2000

4000

6000

8000

10000

12000

1 5 10 20 50 100

Nearest neighbors

Distance Computations URL

dindexmtree

seq

0

2000

4000

6000

8000

10000

12000

1 5 10 20 50 100

Nearest neighbors

Distance Computations STR

dindexmtree

seq

Figure 13. Comparison of the nearest neighbor search efficiency in the number of distance computations forVEC, URL, and STR.

results for the range search are shown in figures 11 and 12, while the performance for thenearest neighbor search is presented in figures 13 and 14.

For all tested queries, i.e. retrieving subsets up to 20% of the database, the M-tree andthe D-Index always needed less distance computations than the sequential scan. However,this was not true for the block accesses, where even the sequential scan was typically

Page 21: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 29

0

200

400

600

800

1000

1200

1 5 10 20 50 100

Nearest neighbors

Page Reads VEC

dindexmtree

seq

0

20

40

60

80

100

120

1 5 10 20 50 100

Nearest neighbors

Page Reads URL

dindexmtree

seq

0

20

40

60

80

100

120

140

1 5 10 20 50 100

Nearest neighbors

Page Reads STR

dindexmtree

seq

Figure 14. Comparison of the nearest neighbor search efficiency in the number of block reads for VEC, URL,and STR.

more efficient than the M-tree. In general, the number of block reads of the M-tree wassignificantly higher than for the D-Index. Only for the URL sets when retrieving largesubsets, the number of block reads of the D-Index exceeded the SEQ. In this situation, theM-tree accessed nearly three times more blocks than the D-Index or SEQ. On the other hand,the M-tree and the D-Index required for this search case much less distance computationsthan the SEQ.

Figure 12 demonstrates another interesting observation: to run the exact match query, i.e.range search with rq = 0, the D-Index only needs to access one block. As a comparison, see(in the same figure) the number of block reads for the M-tree: they are one half of the SEQfor vectors, equal to the SEQ for the URL sets, and even three times more than the SEQ forthe sentences. Notice that the exact match search is important when a specific object is tobe eliminated—the location of the deleted object forms the main cost. In this respect, theD-Index is able to manage deletions more efficiently than the M-tree. We did not verify thisfact by explicit experiments, because the available version of the M-tree does not supportdeletions.

Concerning the block reads, the D-Index looks to be significantly better than the M-tree, while comparison on the level of distance computations is not so clear. The D-Indexcertainly performed much better for the sentences and also for the range search over vectors.However, the nearest neighbor search on vectors needed practically the same number ofdistance computations. For the URL sets, the M-tree was on average only slightly betterthan the D-Index.

4.2.2. Search, space, and insertion scalability. To measure the scalability of the D-Index,M-tree, and SEQ, we have used the 45-dimensional color feature histograms. Particularly, wehave considered collections of vectors ranging from 100,000 to 600,000 elements. For theseexperiments, the structure of D-Index was defined by 37 reference objects and 74 buckets,where only the number of blocks in buckets was increasing to deal with growing files. Theresults for several typical nearest neighbor queries are reported in figure 15. The executioncosts for range queries are reported in figure 16. Except for the SEQ organization, the indi-vidual curves are labelled with a number, representing either the number of nearest neighborsor the search radius, and a letter, where D stands for the D-Index and M for the M-tree.

Page 22: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

30 DOHNAL ET AL.

0

100000

200000

300000

400000

500000

600000

100 200 300 400 500 600

Data Set Size (x1000)

Distance Computations

1 D100 D

1 M100 M

SEQ

0

5000

10000

15000

20000

25000

30000

100 200 300 400 500 600

Data Set Size (x1000)

Page Reads

1 D100 D

1 M100 M

SEQ

Figure 15. Nearest neighbor search scalability.

0

100000

200000

300000

400000

500000

600000

100 200 300 400 500 600

Data Set Size (x1000)

Distance Computations

1000 D2000 D1000 M2000 M

SEQ

0

5000

10000

15000

20000

25000

30000

100 200 300 400 500 600

Data Set Size (x1000)

Page Reads

1000 D2000 D1000 M2000 M

SEQ

Figure 16. Range search scalability.

We can observe that on the level of distance computations the D-Index is usually slightlybetter than the M-tree, but the differences are not significant—the D-Index and M-treecan save considerable number of distance computations compared to the SEQ. To solvea query, the M-tree needs significantly more block reads than the D-Index and for somequeries, see 2000M in figure 16, this number is even higher than for the SEQ.We havealso investigated the performance of range queries with zero radius, i.e. the exact match.Independently of the data set size, the D-Index required one block access and 18 distancecomparisons. This was in a sharp contrast with the M-tree, which needed about 6,000block reads and 20,000 distance computations to find the exact match in the set of 600,000vectors.

The search scale up of the D-Index was strictly linear. In this respect, the M-tree waseven slightly better, because the execution costs for processing a two times larger filewere not two times higher. This sublinear behavior should be attributed to the fact thatthe M-tree is incrementally reorganizing its structure by splitting blocks and, in this way,improving the clustering of data. On the other hand, the D-Index was using a constant bucketstructure, where only the number of blocks was changing. Such observation is posing another

Page 23: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 31

research challenge, specifically, to build a D-Index structure with a dynamic number ofbuckets.

The disk space to store data was also increasing in all our organizations linearly. TheD-Index needed about 20% more space than the SEQ. In contrast, the M-tree occupiedtwo times more space than the SEQ. In order to insert one object, the D-Index eval-uated on average the distance function 18 times. That means that distances to (practi-cally) one half of the 37 reference objects were computed. For this operation, the D-Index performed 1 block read and, due to the dynamic bucket implementation usingsplits, a bit more than one block writes. The insertion into the M-tree was more ex-pensive and required on average 60 distance computations, 12 block reads, and 2 blockwrites.

5. Conclusions

Recently, metric spaces have become an important paradigm for similarity search and manyindex structures supporting execution of similarity queries have been proposed. However,most of the existing structures are only limited to operate on main memory, so they do notscale up to high volumes of data. We have concentrated on the case where indexed data arestored on disks and we have proposed a new index structure for similarity range and nearestneighbor queries. Contrary to other index structures, such as the M-tree, the D-Index storesand deletes any object with one block access cost, so it is particularly suitable for dynamicdata environments. Compared to the M-tree, it typically needs less distance computationsand much less disk reads to execute a query. The D-Index is also economical in spacerequirements. It needs slightly more space than the sequential organization, but at least twotimes less disk space compared to the M-tree. As experiments confirm, it scales up well tolarge data volumes.

Our future research will concentrate on developing methodologies that would supportoptimal design of the D-Index structures for specific applications. The dynamic bucketmanagement is another research challenge. We will also study parallel implemen-tations, which the multilevel hashing structure of the D-Index inherently offers. Finally,we will carefully study two very recent proposals to index structures based on selec-tion of reference objects [6, 13] and compare the performance of D-index with theseapproaches.

Acknowledgments

We would like to thank the anonymous referees for their comments on the earlier versionof the paper.

Note

1. The software is available at http://www-db.deis.unibo.it/research/Mtree/

Page 24: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

32 DOHNAL ET AL.

References

1. T. Bozkaya and Ozsoyoglu, “Indexing large metric spaces for similarity search queries,” ACM TODS, Vol.24, No. 3, pp. 361–404, 1999.

2. B. Bustos, G. Navarro, and E. Chavez, “Pivot selection techniques for proximity searching in metric spaces,”in Proceedings of the XXI Conference of the Chielan Computer Science Society (SCCC01), IEEE CS Press,2001, pp. 33–40.

3. E. Chavez, J. Marroquin, and G. Navarro, “Fixed queries array: A fast and economical data struc-ture for proximity searching,” Multimedia Tools and Applications, Vol. 14, No. 2, pp. 113–135,2001.

4. E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin, “Proximity searching in metric spaces,” ACMComputing Surveys. Vol. 33, No. 3, pp. 273–321, 2001.

5. P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An efficient access method for similarity search in metricspaces,” in Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997, pp. 426–435.

6. R.F.S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos, “Similarity search without tears: The OMNI-familyof all-purpose access methods,” in Proceedings of the 17th ICDE Conference, Heidelberg, Germany, 2001,pp. 623–630.

7. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula, “Separable splits in metric data sets,” in Proceedings of9-th Italian Symposium on Advanced Database Systems, Venice, Italy, June 2001, pp. 45–62, LCM SelectaGroup—Milano.

8. C. Gennaro, P. Savino, and P. Zezula, “Similarity search in metric databases through Hashing,” in Proceedingsof ACM Multimedia 2001 Workshops, Oct. 2001, Ottawa, Canada, pp. 1–5.

9. J.M. Hellerstein, J.F. Naughton, and A. Pfeffer, “Generalized search trees for database systems,” in Proceedingsof the 21st VLDB Conference, 1995, pp. 562–573.

10. B. Seeger, P. Larson, and R. McFayden, “Reading a set of disk pages,” in Proceedings of the 19th VLDBConference, 1993, pp. 592–603.

11. P.N. Yianilos, “Data structures and algorithms for nearest neighbor search in general metric spaces,” ACM-SIAM Symposium on Discrete Algorithms (SODA), 1993, pp. 311–321.

12. P.N. Yianilos, “Excluded middle vantage point forests for nearest neighbor search,” Tech. rep., NEC ResearchInstitute, 1999, Presented at Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop,Jan. 15, 1999.

13. C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, “Indexing the Distance: An efficient method to KNN processing,”in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001, pp. 421–430.

Vlastislav Dohnal received the B.Sc. degree and M.Sc. degree in Computer Science from the Masaryk University,Czech Republic, in 1999 and 2000, respectively. Currently, he is a Ph.D. student at the Faculty of Informatics,Masaryk University. His interests include multimedia data indexing, information retrieval, metric spaces, andrelated areas.

Page 25: D-Index: Distance Searching Index for Metric Data Sets · D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q ∈ D, two fundamental similarity queries

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 33

Claudio Gennaro received the Laurea degree in Electronic Engineering from University of Pisa in 1994 and thePh.D. degree in Computer and Automation Engineering from Politecnico di Milano in 1999. He is now researcherat IEI an institute of National Research Council (CNR) situated in Pisa. His Ph.D. studies were in the fieldof Performance Evaluation of Computer Systems and Parallel Applications. His current main research interestsare Performance Evaluation, Similarity retrieval, Storage Structures for Multimedia Information Retrieval andMultimedia document modelling.

Pasquale Savino graduated in Physics at the University of Pisa, Italy, in 1980. From 1983 to 1995 he has worked atthe Olivetti Research Labs in Pisa; since 1996 he has been a member of the research staff at CNR-IEI in Pisa, workingin the area of multimedia information systems. He has participated and coordinated several EU-funded researchprojects in the multimedia area, among which MULTOS (Multimedia Office Systems), OSMOSE (Open Standardfor Multimedia Optical Storage Environments), HYTEA (HYperText Authoring), M-CUBE (Multiple MediaMultiple Communication Workstation), MIMICS (Multiparty Interactive Multimedia Conferencing Services),HERMES (Foundations of High Performance Multimedia Information Management Systems). Currently, he iscoordinating the project ECHO (European Chronicles on Line). He has published scientific papers in manyinternational journals and conferences in the areas of multimedia document retrieval and information retrieval. Hiscurrent research interests are multimedia information retrieval, multimedia content addressability and indexing.

Pavel Zezula is Full Professor at the Faculty of Informatics, Masaryk University of Brno. For many years, hehas been cooperating with the IEI-CNR Pisa on numerous research projects in the areas of Multimedia Systemsand Digital Libraries. His research interests concern storage and index structures for non-traditional data types,similarity search, and performance evaluation. Recently, he has focused on index structures and query evaluationof large collections of XML documents.