[ieee comput. soc 14th international conference on data engineering - orlando, fl, usa (23-27 feb....

10

Upload: kc

Post on 07-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

High Dimensional Similarity Joins:

Algorithms and Performance Evaluation

Nick Koudas K. C. Sevcik

Department of Computer Science Department of Computer Science

University of Toronto University of Toronto

Toronto, Ontario, CANADA Toronto, Ontario, CANADA

AbstractCurrent data repositories include a variety of data

types, including audio, images and time series. Stateof the art techniques for indexing such data and doingquery processing rely on a transformation of data ele-ments into points in a multidimensional feature space.Indexing and query processing then take place in thefeature space.

In this paper, we study algorithms for �nding re-lationships among points in multidimensional featurespaces, speci�cally algorithms for multidimensionaljoins. Like joins of conventional relations, correla-tions between multidimensional feature spaces can of-fer valuable information about the data sets involved.We present several algorithmic paradigms for solv-ing the multidimensional join problem, and we discusstheir features and limitations.

We propose a generalization of the Size SeparationSpatial Join algorithm, named Multidimensional Spa-tial Join (MSJ), to solve the multidimensional joinproblem. We evaluate MSJ along with several otherspeci�c algorithms, comparing their performance forvarious dimensionalities on both real and syntheticmultidimensional data sets. Our experimental resultsindicate that MSJ, which is based on space �llingcurves, consistently yields good performance across awide range of dimensionalities.

1 IntroductionAnalysis of large bodies of data has become a criti-

cal activity in many di�erent contexts. The data typesinclude audio, images and time series, as well as mix-tures of these. A useful and increasingly common wayof carrying out this analysis is by using characteristicsof data items to associate them with points in a multi-dimensional feature space, so that indexing and queryprocessing can be carried out in the feature space.

Each feature vector consists of d values, which canbe interpreted as coordinates in a d-dimensional space,plus some associated content data. Application de-pendent methods are provided by domain experts toextract feature vectors from data elements and mapthem into d dimensional space. Moreover, domain ex-perts supply the measure of \similarity" of two entitiesbased on their feature vectors. An important query inthis context is the \similarity" query that seeks all

points \close" to a speci�ed point in the multidimen-sional feature space. An additional query of interest,is a generalization of the relational join, speci�callythe multidimensional (or similarity) join query, whichreports all pairs of multidimensional points that are\close" (similar) to each other, as measured by somefunction of their feature value sets.

In a multimedia database that stores collectionsof images, a multidimensional join query can reportpairs of images with similar content, color, texture,etc. The multidimensional join query is useful bothfor visual exploration and for data mining, as well. Ina database of stock price information, a multidimen-sional join query will report all pairs of stocks thatare similar to each other with respect to their pricemovement over a period of time. We evaluate severalalgorithms for performing joins on high dimensionalitydata sets.

In section 2, we formalize the problem at hand. Insection 3, we survey and discuss several algorithms forcomputing multidimensional joins. Section 4 intro-duces a new algorithm to compute multidimensionaljoins, named Multidimensional Spatial Join (MSJ).Section 5 compares the performance of some of thealgorithms experimentally, using both synthetic andreal data sets. Finally, section 6 contains conclusionsand a discussion of future work.

2 Problem StatementWe are given two data sets, A and B, containing

d-dimensional points of the form (x1; x2; : : :xd) and(y1; y2; : : : ; yd) respectively. We assume that ranges ofall attributes are normalized to the unit interval, [0,1],so that 0 � xi � 1 and 0 � yi � 1 for i = 1; : : : ; d.Given a distance �, a d-dimensional join of A and Bcontains all pairs of entries (x;y);x 2 A;y 2 B, suchthat the distance between them, D�

d, satis�es

D�d = (

dX

i=1

j(xi � yi)pj)1=p � � (1)

Then D�d is referred to as Manhattan Distance for p =

1, and Euclidean Distance for p = 2.Assuming that \good" mapping functions are cho-

sen, objects that are \similar" in the view of the

Page 2: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

domain experts will map to points that are close inthe multidimensional space. The d-dimensional joinwill report (exactly once) each pair of d-dimensionalpoints, x 2 A and y 2 B that are within distance �from each other according to the chosen distance func-tion. Our goal is to identify e�cient algorithms forcomputing d-dimensional joins for data sets that aremuch too large to �t in main memory.

The multidimensional join problem as we de�ne ithas a worst case complexity of O(n2). Consider twod-dimensional point sets A and B, of cardinality neach, such that every point of B is within D�

d of ev-ery point of A for some value of �. In this case, thenumber of output tuples produced is n2. Under ourde�nition of the problem, this could happen for any �,provided that both data sets are clustered in the sameportion of the multidimensional space. In addition,even when no clustering exists and the points are uni-formly distributed in the multidimensional space, forlarge values of �, the computational work and outputsize are O(n2).

3 Survey of Various Algorithmic Ap-

proachesIn this section, we discuss and analyze several al-

gorithms for computing multidimensional joins. Weseek algorithms for which e�ciency remains high asdimensionality increases. Moreover, since the worstcase complexity of the multidimensional join problemis O(n2), we are interested in identifying instances ofthe problem that can be solved faster, and in compar-ing the algorithms discussed based on their ability toexploit those instances.

Algorithms for the multidimensional join problemcan be separated into two categories. The �rst cate-gory includes algorithms that treat data sets for whichno indices are available. The second category includesalgorithms that utilize preconstructed indices to solvethe multidimensional join problem.

We describe and analyze four algorithmic ap-proaches from the �rst category that can provide asolution to the multidimensional join problem: BruteForce, Divide and Conquer, Replication, and SpaceFilling Curves. All four algorithmic approaches usethe following technique to identify points of interestwithin distance �

2of a given point x. Each point x

can be viewed as the center of a hypercube of side�. We refer to the hypercube as the approximation ofthe multidimensional point. The distance of x fromall the points in the hypercube is computed (based onsome distance metric) and only the points within dis-tance � of one another are reported. For the restof this paper, we assume Euclidean distances, but anyother distance metric can be applied instead, withouta�ecting the operation of the algorithms.

Although the problem of searching and indexingin more than one dimension has been studied exten-sively, no indexing structure is known that retainsits indexing e�ciency as dimensionality increases. Awide range of indexing structures have been proposedfor the two dimensional indexing problem [Sam90].Although conceptually most of these structures gen-

200

400

600

800

1000

1200

1400

1600

2 4 6 8 10 12 14 16 18 20

R*

tree

Con

stru

ctio

n tim

e (s

ec)

Dimension

"ConstructionCost"

Figure 1: R* tree construction cost as dimensionalityincreases

eralize to multiple dimensions, in practice their in-dexing e�ciency degenerates rapidly as dimensional-ity increases. A recent experimental study by Berch-told et al. [BKK96] showed that using the X-tree, amultidimensional indexing structure based on R-trees[Gut84], several multidimensional queries degenerateto linear search as dimensionality increases.

An algorithm was proposed by Brinkho� et al.[BKS93] for the two dimensional spatial join problemusing R�-trees [BKSS90]. Since the R-tree family isa popular family of indexing structures, we extendedthe algorithm of Brinkho� et al. to multiple dimen-sions, and we report on its performance in subsequentsections. We believe that the trends in performancewe report for the join of multidimensionalR�-trees arerepresentative of the join performance of other struc-tures based on the R-tree concepts.

Figure 1 presents the time to construct an R��treeindex for 100,000 multidimensional points for variousdimensionalities, using multiple insertions. The con-struction time is measured on a 133MHz processor (in-cluding the time to write �lled pages into disk). Thecost increases approximately linearly as dimensional-ity increases, since the work the algorithm performsper point increases as more dimensions are added. No-tice that bulk loading of the index requires applicationof a multidimensional clustering technique, which hashigh cost as well. Figure 1 suggests that, for an on-linesolution to the multidimensional join problem, build-ing indices on the y for non-indexed data sets andusing algorithms from the second category to performthe join might not be a viable solution for high dimen-sionalities due to the prohibitive index constructiontimes.

3.1 Algorithms That Do Not Use Indices3.1.1 Brute Force Approach

Main Memory Case: If data sets are small enoughto �t in main memory together, both can be read intomemory and the distance predicate can be evaluatedon all pairs of data elements. Assuming A and Bare two multidimensional data sets containing nA andnB points respectively, the total cost of this process

Page 3: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

will be nA � nB predicate evaluations. The cost ofeach predicate evaluation increases linearly with thedimensionality of the data points. A faster algorithmfor the predicate evaluation step is to use a general-ization of the Plane Sweep technique in multiple di-mensions [PS85]. This makes it possible to reducethe number of distance computations by evaluatingthe predicate only between pairs of multidimensionalpoints for which the corresponding hypercubes inter-sect. The complexity of a d dimensional sweep in-volving O(n) points, to report k pairs of overlappingobjects is O(n lognd�1 + k) [Mel91]. Note that if twohypercubes of side 2� = � overlap, the points at theircenters are not necessarily within distance � of eachother. Although the algorithm works well on aver-age, in the worst case all the pairs of distance compu-tations have to be evaluated at a total cost of nA�nBpredicate evaluations plus the overhead of the multi-dimensional sweep.

Nested Loops(NL): When both data sets cannot �tin main memory, nested loops is the simplest algo-rithm to apply. Assuming a bu�er space of M pages,the total IO cost in page accesses of the join usingnested loops will be approximately:

jAj+jAj

M � 1jBj (2)

Each multidimensional point is approximated with ahypercube, and point pairs with intersecting hyper-cubes are tested for proximity in main memory usinga multidimensional sweep. Nested loops can always beapplied between two data sets containing O(n) points,but it is an O(n2) algorithm. The performance of thenested loops algorithm is independent of data distri-bution, being equally costly for all data distributions.In the relational domain, Merge Sort joins and HashJoins have been shown to lead to less costly solutionsthan nested loops under reasonable statistical assump-tions. We investigate analogous alternatives in themultidimensional case.

3.1.2 Divide and Conquer

In this section we examine two algorithms thatare based on the \divide and conquer" algorithmicparadigm. The �rst one is an application of divideand conquer in multiple dimensions, and the second isa recently proposed indexing structure for the multi-dimensional join problem.

Multidimensional Divide and Conquer Ap-proach (MDC): Multidimensional Divide and Con-quer (MDC) is an algorithmic paradigm introducedby Bentley [Ben80], that can be directly applied tothe problem at hand. To solve a problem in a mul-tidimensional space, the underlying idea behind theMDC paradigm is to recursively divide the space adimension at a time, and solve the problem in eachresulting subspace. Once the problem is solved in allsibling subspaces, then the solutions of the subspaces

epsilon epsilon epsilon

A

B

Y

X X

(a) (b)

A B

A1

A2

B1

B2

Figure 2: MDC algorithm for (a) one and (b) twodimensions

are combined in a way speci�c to the problem underconsideration.

Consider the one dimensional case (d=1). Giventwo sets of n points on a line, we are to report allpairs of points, one from each set, within distance �from each other. We can do this by sorting both sets(an O(n logn) operation), and performing a scan ofboth �les by treating portions of each �le correspond-ing to a range of values of the attribute of width 2�.As illustrated in �gure 2a, both data sets are sortedon increasing value of the coordinate. By keeping inmemory all elements with values in the range 0 to � or� to 2� from both �les, we have all the points necessaryto correctly report the joining points in the 0 to � rangethat are part of some joining pair. No more points arenecessary since any point that joins with a point in therange 0 to � must be within distance 2� from the leftside of the 0 to � range. Once we are done with the0 to � range, we can discard the corresponding par-titions from the bu�er pool and read the next range,2� to 3�, to �nish the processing of the � to 2� range,and so on. Corresponding ranges in both �les can beprocessed via the plane sweep algorithm. Figure2b illustrates the two dimensional version of the algo-rithm. Generalizing this approach to d-dimensionalspaces for data sets involving O(n) multidimensional

points will give an O(n logd n) [Ben80] algorithm. Al-though it is conceptually appealing, the application ofmultidimensional divide and conquer to solve the mul-tidimensional join problem leads to several problemswhen it is applied in practice. In the general case, thestatistical characteristics of the two multidimensionaldata sets will be di�erent. As a result, partitioningaccording to the median of a dimension in the �rstdata set might create highly unbalanced partitions forthe second. Balanced partitions are necessary in orderto attain the complexity of O(n logd n) for a probleminvolving n d-dimensional points. An additional prob-lem is that the constant in the complexity expressionis too large: for a d dimensional space, after partition-ing according to d� 1 dimensions we create 2d�1 par-titions. Each of these partitions has to be comparedagainst all 2d�1 partitions of the joining space. Anadditional complication is that in the worst case thememory space needed for output bu�ering while par-titioning is exponential in the number of dimensions.

Page 4: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

Multidimensional divide and conquer creates 2d par-titions of the space and thus needs 2d output bu�ersduring the partitioning phase. In summary, we ex-pect that such an approach might be suitable for lowdimensionalities and data sets with similar statisticalcharacteristics, but it is not promising as a general so-lution to the multidimensional join problem.

The �-KDB tree: A new indexing structure for themultidimensional join problem was proposed recentlyby Shim et al. [SSA97]. The �-KDB tree is intended tospeed up the computation of hypercube intersectionsin main memory.

Given two multidimensional data sets and a dis-tance �, the algorithm proceeds by choosing a dimen-sion and sorting the data sets on this dimension. Ifboth data sets are sorted already on a common di-mension, no sorting is necessary. Then the algorithmproceeds to read the partitions corresponding to in-tervals of size 2� in dimension of sorted order of both�les into main memory and building the �-KDB struc-ture on them. The structure is a variant of KDB trees[Rob81]. It o�ers a space decomposition scheme thatfacilitates tree matching since the boundaries of spacepartitions are canonical. That way, assuming both�les have been sorted on a particular dimension, thealgorithm can compute the join in time linear in thesize of the input data sets by scanning the sorted datasets. In order for the time to be linear, however, thesum of the portions of both A and B in each 2� rangealong the chosen dimension must �t in main memory.If this is not the case, several problems arise. As newdimensions are introduced on which to perform parti-tioning, the algorithm must issue a complex scheduleof non-sequential page reads from the disk. At eachstep, the algorithm has to keep neighboring partitionsat the same time in main memory. The number ofneighboring partitions is exponential in the numberof dimensions used for partitioning. Assuming thatk dimensions are used for partitioning, in the worstcase, we have to keep 2k partitions from each �le be-ing joined in memory at all times. Since the pagesholding the partitions are sequentially stored on disk,only two neighboring partitions can be stored adjacentto a given partition. The rest of the relevant partitionsfor each step have to be retrieved by scheduling non-sequential IOs.

3.1.3 Replication Approach (REPL)

The replication approach to the multidimensional joinproblem involves replicating entities and thus causing�le sizes to grow. Algorithms based on the replica-tion approach for the two dimensional problem havebeen proposed by Patel and DeWitt [PD96] and Loand Ravishankar [LR96]. Here, we explore possiblegeneralizations of these algorithms to higher dimen-sions. The underlying idea for these algorithms is todivide the two dimensional space into a number of par-titions and then proceed to join corresponding parti-tion pairs. It is possible that the size of a partition pairexceeds the mainmemory size and, as a result, the pair

Given two d-dimensional data sets, A and B, and�:

� Select the number of partitions.

� For each data set:

1. Scan the data set, associating each mul-tidimensional point with the d dimen-sional hypercube of side � for which thepoint is the center.

2. For each hypercube, determine all thepartitions to which the hypercube be-longs and record the d dimensional pointin each such partition.

� Join all pairs of corresponding partitions us-ing multidimensional sweep, repartitioningwhere necessary.

� Sort the matching pairs and eliminate dupli-cates

Figure 3: The REPL Algorithm

must be partitioned more �nely. During a partition-ing phase, any entity that crosses partition boundariesis replicated in all partitions with which it intersects.Each hypercube that crosses a boundary is replicatedin each partition it intersects. This process is repeatedfor the second data set as well. Once the partitions forboth �les have been populated and repartitioning hasbeen applied where necessary to make partition pairs�t in main memory, we proceed to join correspondingpartition pairs. This strategy correctly computes thejoin, because all the possible joining pairs occur in cor-responding partitions. Points located in correspond-ing partitions form output tuples if they are found tobe within � distance of one another. The algorithm asdescribed generalizes directly to higher dimensions asshown in �gure 3. We form a d dimensional hypercubeof side � around each point in both multidimensionalspaces and proceed in the same way. An analysis ofthe replication induced by REPL as dimensionality in-creases is available elsewhere [KS97a].

There are two major drawbacks to approaches thatintroduce replication. The appropriate degree of par-titioning of the data space is very di�cult to chooseunless precise statistical knowledge of the multidimen-sional data sets is available. Although having suchknowledge might be possible for static multidimen-sional data sets, it is di�cult and costly to obtain fordynamic data sets. Secondly, when points are rela-tively dense or � is large, the amount of replicationthat takes place appears to be very large, and it be-comes still larger as dimensionality increases.

3.1.4 Space Filling Curves Approach

In this subsection, we explore an algorithm that usesspace �lling curves to solve the multidimensional join

Page 5: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

problem.

Orenstein's Algorithm (ZC): Orenstein proposedan algorithm, that we call ZC, to perform joins ofmultidimensional objects [Ore91]. Starting with mul-tidimensional objects that are approximated by theirminimum bounding hypercubes, the hypercubes aretested for intersections. For each pair of intersectinghypercubes, an object intersection test is performed.The algorithm is based on z-curves and their proper-ties. Z-curves re ect a disjoint decomposition of thespace and ZC relies on the following property of z-curves to detect intersections: two approximations ofmultidimensional objects intersect if and only if thez-value of one is a pre�x of the z-value of the other.

The algorithm imposes a recursive binary splittingof the space up to a speci�c granularity. Each approx-imated entity is placed in a space division that fullyencloses it. Orenstein [Ore89] presents an analysis ofthe implication of this decomposition scheme on rangequery performance, and, in subsequent work [Ore91],presents the performance of the multidimensional joinalgorithm.

This algorithm can be applied to the multidimen-sional join problem that we address in this paper. Inour context, each multidimensional point is approxi-mated with a d dimensional hypercube of side �. Foreach multidimensional point, the z-curve value (ZV) atdimensionality d is computed. As dimensionality in-creases, the processor time to compute the ZV as wellas the number of bits required to store it increases.We assume that, in a preprocessing step, the ZV asso-ciated with each point is computed to some speci�edprecision.

Each data set is scanned and the ZV of each hy-percube is transformed to the ZV of the space par-tition that contains it. The transformation involvessetting to zero a number of least signi�cant bits ofZV, depending on the space partition that containsthe hypercube. The ZV is a variable length bit string.Shorter bit strings correspond to larger space par-titions in the recursive binary decomposition of thespace. Both data sets are then sorted into non-decreasing order of ZV values. Then the algorithmproceeds to merge the two data sets using a stack perdata set. At each step, the smaller ZV is selected andprocessed by comparing it to the ZV at the top of thestack. A detailed description of Orenstein's algorithmis available elsewhere [KS97a].

3.2 Algorithms That Use PreconstructedIndices

The best known spatial join algorithm for R-treesis the one proposed by Brinkho� et al. [BKS93]. Wehave extended it to apply to multidimensional pointsets indexed with R�-trees [BKSS90]. The R�-treejoin algorithm is based on an index sweeping pro-cess. When the indices have the same height, the al-gorithm proceeds top-down sweeping index blocks atthe same level. At a speci�c level, the pairs of overlap-ping descriptors are identi�ed and, at the same time,the hyperrectangles of their intersections are com-puted. This information is used to guide the search

in the lower levels, since descriptors not overlappingthe hyperrectangle of intersection of their parents neednot be considered for the join. The algorithm usesa greedy bu�er pinning technique to keep relevantblocks in the bu�er in order to minimize block re-reads. When the indices do not have the same height,the algorithm proceeds as described above up to a cer-tain point and then degenerates into a series of rangequeries.

The multidimensionalR�-tree join algorithm as de-scribed can perform the multidimensional similarityjoin given a distance � as follows: all MBRs of indexpages and data pages, as created by the insertion ofthe multidimensional points, are extended by �

2in each

dimension. The extension is necessary to assure thatwe do not miss possible joining pairs. The extendedMBRs of index pages as well as data points are joinedusing multidimensional sweep.

3.3 DiscussionWe have presented two categories of algorithms that

can be used to solve the multidimensional join prob-lem. In this paper, we do not include Divide andConquer algorithms in our experiments due to theirknown worst case memory and IO requirements. Al-though MDC will yield an e�cient solution for lowdimensionalities, it is inapplicable for higher dimen-sionalities since, in the worst case, it requires a bu�erpool size that is exponential in the dimensionality.Similarly, the �-KDB approach will yield very e�cientsolutions for certain data distributions, but the algo-rithm's worst case memory requirement and IO com-plexity are prohibitive for data sets on which partition-ing on more than one dimension has to be imposed.

In the next section, we introduce an algorithm,called Multidimensional Spatial Join (MSJ), for themultidimensional join problem. MSJ can use anynumber of dimensions to decompose the space withouta�ecting its IO cost.

4 Multidimensional Spatial Join

(MSJ)To perform the join of two multidimensional data

sets, A and B, we may also use a generalization of theSize Separation Spatial Join algorithm (S3J) [KS97b].The S3J algorithmmakes use of space �lling curves toorder the points in a multidimensional space. We as-sume that the Hilbert value of each multidimensionalpoint is computed at dimensionality d to dL bits ofprecision where L is the maximumnumber of levels ofsize separation. We consider the Hilbert value compu-tation a preprocessing step of this algorithm. For twod dimensional data sets, A and B, and given a distance�, we impose a dynamic hierarchical decomposition ofthe space into level �les. We scan each data set andplace each multidimensional point (x1; x2; : : : ; xd), ina level �le l, determined by

l = min1�i�d

ncb(xi ��

2; xi +

2) (3)

where ncb(b1; b2) denotes the number of initial com-mon bits in bit sequences b1 and b2. This corresponds

Page 6: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

Hms Hme

P1

P2

P3

P4

P5

P6

P7

P8

F1

F2

Figure 4: The multilevel merge phase of MSJ

to the placement of the approximated multidimen-sional point in the smallest subpartition of the mul-tidimensional space that fully encloses it. The Hilbertvalue, H, of each multidimensional point is trans-formed to the maximum Hilbert value of the spacepartition that encloses it at level l. This transforma-tion can be achieved by setting to one the (L�l)d leastsigni�cant bits of H. Each level �le is then sortedinto non-decreasing order of Hilbert values.

The decomposition of the multidimensional spaceachieved this way provides a exible way to performthe multidimensional join [KS97b]. Each subpartitionof a level �le has to be matched against the corre-sponding subpartitions at the same level and eachhigher level �le of the other data set. That way, inthe worst case, we need to keep in memory as manysubpartitions for each data set as there are level �les.Figure 5 presents the algorithm. Both data sets arescanned and partitioned into level �les. At the sametime, Hilbert value transformation takes place. Alllevel �les are sorted on the Hilbert value. Finally, amulti-way merge of the level �les takes place.

Figure 4 illustrates the merge phase of the algo-rithm. Two �les (F1 and F2) have been partitionedinto two level �les each. At �rst, the algorithm is-sues a read of one partition from each level �le inmain memory, thus partitions P1; P2; P3; P4 will beread. The minimum starting Hilbert value over allpartitions (Hms) as well as the minimumending valueis computed (Hme). Corresponding entries between[Hms;Hme] can be processed in main memory. Parti-tions that are entirely processed (P3 in �gure 4) aredropped from the bu�er pool and Hme is updated.Processing can continue by replacing processed parti-tions from the corresponding level �les (read in P7 in�gure 4) and advancing Hme as needed, until all level�les are processed.

In separating the points in each data set into level�les, we may use any subset of the dimensions. Thenumber of dimensions used to separate the input datasets to level �les a�ects the occupancy of each level�le. In a d dimensional space, for level �le l, there are2ld space partitions. Each non-empty partition willhave to be memory resident at some point in the al-gorithm's execution. Using k dimensions (k � d) to

Given two d-dimensional data sets, A and B, and �distance predicate:

� For each data set:

1. Scan the data set and partition it into level�les, transforming the Hilbert value of thehypercube based on the level �le to whichit belongs.

2. Sort the level �les into nondecreasing or-der of Hilbert Values.

� Perform a multi-way merge of all level �les

Figure 5: The MSJ Algorithm

perform the separation is expected to yield lower spacepartition occupancy per level �le than using k�1. Thisis because adding one more dimension, adds 2l parti-tioning planes at level l, which can force some objectsto higher level �les (smaller l). Balanced occupancyof space partitions of various levels is desirable. Al-though theoretically an arti�cial data set can be con-structed such that, for a speci�c value of �, the entiredata space falls inside one space partition, the moredimensions we use for partitioning, the less likely thisbecomes. As is indicated by equation 3, the computa-tion of the level a point belongs to involves a numberof bitwise operations linear in the number of dimen-sions. All the dimensions can be used for the levelcomputation without signi�cant processor cost.

5 Experimental EvaluationIn this section, we present an experimental evalu-

ation of the performance of MSJ relative to some ofthe algorithms described in the previous sections forjoining multidimensional data sets.

5.1 Description of Data setsFor our assessment of the performance of the mul-

tidimensional join algorithms, we used both syntheticand real data sets of various dimensionalities. Sincethe size of each record grows with the number of at-tributes (dimensions), the overall �le size for the �xednumber of points increases with the number of dimen-sions. We choose to keep the number of multidimen-sional points constant in spaces of di�erent dimen-sions. An alternative would be to keep the total �lesize constant by reducing the total number of pointsas dimensionality increases. However this would cre-ate very sparsely populated multidimensional spacesand the performance of multidimensional joins for in-creasing values of � would be di�cult to assess, unlessvery large �le sizes were used.

Table 1 presents the data set sizes in terms of totalnumber of points and total �le sizes in bytes at the di-mensionalities we experiment with. We keep the bu�erpool size constant (2MB) for all experiments.

We perform two series of experiments involving syn-thetic and real data sets. Additional experimental re-sults are given elsewhere [KS97a]. For each series of

Page 7: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

Dimension D1:50,000 D2:84,640

3/4 1.6 MB 3.24 MB

8 3.5 MB 5.9 MB

12 5 MB 8.48 MB

20 8.14 MB 13.78 MB

Table 1: Characteristics of Data Sets and Sizes as Di-mensionality Increases

experiments, we report two sets of results. In one, wekeep � constant and increase the dimensionality of thedata set. In the other, we keep the dimensionality ofthe data set constant and increase the value of �.

It is highly likely that real multidimensional datasets will contain clusters of multidimensional points.These clusters will correspond to groups of entitieswith similar characteristics. For this reason, in the�rst series of experiments, we generated multidimen-sional data sets containing clusters of multidimen-sional points, and we evaluated the performance of thealgorithms using the resulting data sets, which havecharacteristics D1. The clusters were generated byinitializing a kernel in the data space and distributingthe points with uniformly random directions aroundthe kernel points at distance selected from an expo-nential distribution with mean 0.5. Points outside theunit hypercube were clipped.

The second series of experiments involved actualstock market price information collected for 501 com-panies. We applied a Discrete Fourier Transform (assuggested by Faloutsos et al.[FRM94]) to transformthe time series information into points in a multidi-mensional space. Using a period of ten days, we ex-tracted several time series from the sequence of pricesfor each speci�c stock, obtaining 84,640 multidimen-sional points. The resulting data set had characteris-tics D2.

5.2 Experimental Results5.2.1 Experiments with Algorithms not based

on preconstructed indices

Our experiments are summarized in table 3. Theresults of these experiments are presented in �gures 7and 8, respectively.

Although the IO behavior of MSJ and the ZC al-gorithm is the same, there are additional processorcosts for the ZC algorithm. Figure 6(a) presents theportions of time spent in the various phases of thealgorithms. The main di�erence between MSJ andthe ZC algorithm is that the sweep process in mainmemory is data driven for MSJ but partition drivenfor ZC. ZC relies on the pre�x property of the z-curve to perform the join, candidates have to be gen-erated from the stack each time the pre�x property ofthe curve is violated. Violation of the pre�x propertytakes place each time the curve crosses boundaries be-tween di�erent space partitions. Since partitions areseldom full and thus are collapsed together in physicalpages, this leads to a large amount of data movement

in and out from the stacks, as well as plane sweepoperations, which constitute an additional processingcost for the algorithm, as is evident from �gure 6(a).Moreover, ZC requires data structure manipulationson the stacks and pre�x evaluations for each multidi-mensional point of the data sets being joined.

For REPL, the amount of replication during thepartitioning phase increases with dimensionality, andthis increases both processor and IO cost. Processorcost is higher since, by introducing replication, morepoints are swept. In addition, a duplicate eliminationphase has to take place at the end of the algorithm,and this involves a sort of the result pairs. Finally,the response time of nested loops increases with di-mensionality since relatively less bu�er space is avail-able to the operation. Figure 7 presents the responsetime of the algorithms for experiment 1, which involvestwo data sets containing points that are clustered. Asdimensionality increases, the response time of MSJ in-creases due to increased sorting cost, since the bu�erspace available to the sort holds smaller and smallerfractions of the data sets. The processor cost increasesonly slightly with dimensionality, since the size of thejoin result does not change much. At low dimensional-ity the size of the join result is a little larger than thesize of the input data sets and it decreases to becomeequal to the size of the input data sets as dimension-ality increases. At higher dimensions, a hypersphereof �xed radius inscribes lower percentages of the totalspace, and the probability for a point to match withsomething more than itself drops rapidly. Figure7(b) presents the response time of the algorithms forincreasing values of � at dimensionality d = 8. For allthe algorithms, response time increases rapidly with�. Due to clustering, the increase in the size of thejoin result is large, and, as a result, the processor timeneeded to compute the candidate and actual pairs ishigh.

Figure 8 presents the performance of the algorithmsfor experiment 2, which involves real stock marketdata. We employ a multidimensional join operationwhich reports only the total number of actual joiningpairs. (We do not materialize the full join results, dueto their size.) In �gure 8(a), we present the responsetime of the algorithms for � = 0:03 as dimensional-ity increases. For nested loops and REPL the basicobservations are consistent with those from previousexperiments. Both algorithms that use space �llingcurves have increased response times due to their sort-ing phase as dimensionality increases. However, pro-cessor time drops due to the smaller join result sizewith increasing dimensionality. Both algorithms areprocessor bound for this experiment, and this explainsthe smoother increase in response time as dimension-ality increases.

Figure 8(b) presents the response time of the algo-rithms at dimensionality d = 4, for increasing valuesof �. All algorithms appear to be processor bound andthe increase in the join result size accounts for theincrease of response times for all algorithms.

Table 3 presents a summary of approximate re-sponse time ratios between other algorithms and MSJas observed in our experiments. The results are rea-

Page 8: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

(a) Increasing dimension for epsilon=0.011 (b) Increasing epsilon for d=12

Figure 6: Portion of time spend at di�erent phases of the algorithms, for a self join of 100,000 uniformly distributedpoints.

10

100

1000

10000

2 4 6 8 10 12 14 16 18 20

Res

pons

e T

ime

(sec

)

Dimension

MSJ

ZC

REPL

Nested

"MSJ""ZC"

"REPL""Nested"

10

100

1000

10000

0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 0.011

Res

pons

e T

ime

(sec

)

Epsilon

MSJ

ZC

REPL

Nested

"MSJ""ZC"

"REPL""Nested"

(a) Increasing dimension for epsilon=0.006 (b) Increasing epsilon for d=8

Figure 7: Performance of multidimensional joins between two distinct clustered data sets

10

100

1000

10000

4 6 8 10 12 14 16 18 20

Res

pons

e T

ime

(sec

)

Dimension

MSJ

ZC

REPL

Nested

"MSJ""ZC"

"REPL""Nested"

10

100

1000

10000

0.003 0.0035 0.004 0.0045 0.005 0.0055 0.006

Res

pons

e T

ime

(sec

)

Epsilon

MSJ

ZC

REPL

Nested

"MSJ""ZC"

"REPL""Nested"

(a) Increasing dimension for epsilon=0.003 (b) Increasing epsilon for d=4

Figure 8: Performance of joins between stock market data

Page 9: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

Experiment. Kind of Charactestics. % bu�er

id Operation of Data d=3/4 d=8 d=12 d=20

3 Self Join D1-clustered 66% 30% 20% 12%

4 Self Join D2-actual 30.8% 16.9% 11.8% 7.76%

Table 2: Experiments performed and characteristics of data sets involved in each experiment. The % bu�ercolumn reports the bu�er space available to each experiment as a percentage of the total size of the data setsjoined, for various dimensionalities

Ratio Exp d varies � varies

RNested/RMSJ E1 4 20 12 10E2 8 30 8 6

RREPL/RMSJ E1 2 6 3 2.5E2 4 3.5 4 2.5

RZC/RMSJ E1 1.3 1.3 1.5 1.3E2 1.5 1.5 1.5 1.5

Table 3: Summary of approximate response time ra-tios of other algorithms to MSJ

sonably consistent over the ranges of d and � that weexplored. The ZC algorithm had response times be-tween 1.3 and 1.5 times longer than MSJ over therange of experiments. The REPL algorithm showedmore variability in its relative performance, with ra-tios ranging from 2 to 6 in various cases. Finally, theresponse times of nested loops were 4 to 30 times largerthan MSJ's over the range of cases tested.

5.2.2 Experiments with Algorithms Based onPreconstructed Indices

The experimental results presented for algorithms thatdon't require preconstructed indices suggest that ap-proaches based on space �lling curves, and speci�callyMSJ, are e�ective in solving the multidimensional joinproblem. We also investigate the performance of MSJin comparison to the algorithms that utilize precon-structed indices.

Since MSJ's approach requires that the Hilbert val-ues of the multidimensional points be precomputed,in this section we compare the performance of MSJ tothat of the R-tree spatial join algorithm (RTJ), assum-ing that multidimensional R-trees already exist on thedata sets involved in the join operation. That is, thecost to construct the multidimensional R-tree indicesof the joined data sets is omitted from the performancenumbers.

Figure 9(a) presents the performance of both MSJand RTJ for self join of data sets containing 100,000uniformly distributed points as dimensionality in-creases. For MSJ the observations remain exactlythe same as those pertaining to �gure 7. The perfor-mance of RTJ deteriorates as dimensionality increases.As dimensionality gets larger, the overlap between R-

10

100

1000

10000

2 4 6 8 10 12 14 16 18 20

Res

pons

e T

ime

(sec

)

Dimension

"MSJ""R-tree-Join"

(a) Increasing dimension for epsilon=0.011

0

100

200

300

400

500

600

700

800

0.005 0.006 0.007 0.008 0.009 0.01 0.011 0.012 0.013 0.014

Res

pons

e T

ime

(sec

)

Epsilon

"MSJ""R-tree-Join"

(b) Increasing epsilon for d=12

Figure 9: MSJ vs R�-tree Join

tree index and leaf entries increases. As a result, thenumber of pages that have to be pinned in the bu�erpool is likely to increase as well. Since the size of thebu�er pool is kept constant for varying dimensionali-ties for both algorithms, the number of page re-readsthat RTJ has to schedule is expected to increase withdimensionality and this explains the deterioration inperformance. The performance of RTJ is very sensi-tive to the amount of bu�ering available to the opera-tion. Since the overlap in the R-tree index is expectedto increase with dimensionality, the performance de-teriorates as dimensionality increases. Figure 9(b)presents the performance of both MSJ and RTJ for

Page 10: [IEEE Comput. Soc 14th International Conference on Data Engineering - Orlando, FL, USA (23-27 Feb. 1998)] Proceedings 14th International Conference on Data Engineering - High dimensional

increasing epsilon and dimensionality d = 12. Bothalgorithms lead to increased processor time due to theincreasing number of join tests, for increasing valuesof epsilon. However the performance of RTJ is worsethan that of MSJ, since it requires a larger number ofIO's.

6 ConclusionsIn this paper we have investigated the problem

of computing multidimensional joins between pairs ofmultidimensional point data sets. We have describedseveral algorithmic approaches that can be appliedto the computation of multidimensional joins. Thereare two main contributions in this work. First, wepresented the MSJ algorithm and we experimentallyshowed that it is a promising solution to the multi-dimensional join problem. Second, we presented sev-eral algorithmic approaches to the multidimensionaljoin problem and discussed their strengths and weak-nesses.

Several directions for future work on multidimen-sional joins are possible. The join result size of a mul-tidimensional join operation is very sensitive to datadistributions and to the value of �. For some data dis-tributions, even very small values of � can yield verylarge result sizes. We feel that multidimensional joinqueries will be useful in practice only if they can bedone interactively. A user will issue a query supply-ing an � value, and, after examining the results, mightre�ne the choice of �. To facilitate this type of inter-action, it would be bene�cial to restrict the join resultsize, thus saving a substantial amount of computa-tion for generating the complete list of actual joiningpairs. One possible, and useful restriction, would beto report for each multidimensional point its k nearestneighbors located at most distance � from the point.Adapting the multidimensional join query to performthis type of computation would be useful.

References[Ben80] Jon Louis Bentley. Multidimensional Divide-

and-Conquer. CACM, Vol. 23, No. 4, pages

214{229, April 1980.

[BKK96] Stefan Berchtold, Daniel A. Keim, and Hans-

Peter Kriegel. The X-tree: An Index Struc-

ture for High Dimensional Data. Proceedings of

VLDB, pages 28{30, September 1996.

[BKS93] Thomas Brinkho�, Hans-Peter Kriegel, and

Bernhard Seeger. E�cient Processing of Spa-

tial Joins using R-trees. Proceedings of ACM

SIGMOD, pages 237{246, May 1993.

[BKSS90] N. Beckmann, Hans-Peter Kriegel, Ralf Schnei-

der, and Bernhard Seeger. The R* - tree: An

E�cient and Robust Access Method for Points

and Rectangles. Proceedings of ACM SIGMOD,

pages 220{231, June 1990.

[FRM94] Christos Faloutsos, M. Ranganathan, and

I. Manolopoulos. Fast Subsequence Matching

in Time Series Databases. Proceedings of ACM

SIGMOD, pages 419{429, May 1994.

[Gut84] A. Guttman. R-trees : A Dynamic Index Struc-

ture for Spatial Searching. Proceedings of ACM

SIGMOD, pages 47{57, June 1984.

[KS97a] Nick Koudas and K. C. Sevcik. High Dimen-

sional Similarity Joins: Algorithms and Perfor-

mance Evaluation. Technical Report TR-369,

University of Toronto, December 1997.

[KS97b] Nick Koudas and K. C. Sevcik. Size Separation

Spatial Join. Proceedings of ACM SIGMOD,

pages 324{335, May 1997.

[LR96] Ming-Ling Lo and Chinya V. Ravishankar. Spa-

tial hash-joins. Proceedings of ACM SIGMOD,

pages 247{258, June 1996.

[Mel91] K. Melhorn. Data Structures and Algorithms:

III, Multidimensional Searching and Computa-

tional Geometry. Sprieger-Verlag, New York-

Heidelberg-Berlin, June 1991.

[Ore89] J. Orenstein. Redundancy in Spatial Database.

Proceedings of ACM SIGMOD, pages 294{305,

June 1989.

[Ore91] Jack Orenstein. An algorithm for computing the

overlay of k-dimensional spaces. Symposium on

Large Spatial Databases, pages 381{400, August

1991.

[PD96] Jignesh M. Patel and David J. DeWitt. Parti-

tion Based Spatial-Merge Join. Proceedings of

ACM SIGMOD, pages 259{270, June 1996.

[PS85] F. P. Preparata and M. I. Shamos. Computa-

tional Geometry. Springer-Verlag, New York-

Heidelberg-Berlin, October 1985.

[Rob81] J.T. Robinson. The K-D-B-Tree: A Search

Structure for Large Multidimensional Dynamic

Indexes. Proceedings ACM SIGMOD, pages 10{

18, 1981.

[Sam90] Hanan Samet. The Design and Analysis of Spa-

tial Data Structures. Addison Wesley, June

1990.

[SSA97] K. Shim, R. Srikant, and R. Agrawal. High-

dimensional Similarity Joins. Proc. of the In-

ternational Conference on Data Engineering.,

pages 301{311, April 1997.