1 ranking spatial data by quality preferences -...

1

Ranking Spatial Data by Quality PreferencesMan Lung Yiu, Hua Lu, Nikos Mamoulis, and Michail Vaitis

Abstract—A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example,using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness oftheir location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatialneighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular regionwithin a given distance from the flat. Another intuitive definition is to consider the whole spatial domain and assign higher weights tothe features based on their proximity to the flat. In this paper, we formally define spatial preference queries and propose appropriateindexing techniques and search algorithms for them. Extensively evaluation of our methods on both real and synthetic data reveal thatan optimized branch-and-bound solution is efficient and robust with respect to different parameters.

Index Terms—H.2.4.h Query processing, H.2.4.k Spatial databases

F

1 INTRODUCTION

Spatial database systems manage large collections ofgeographic entities, which apart from spatial attributescontain non-spatial information (e.g., name, size, type,price, etc.). In this paper, we study an interesting type ofpreference queries, which select the best spatial locationwith respect to the quality of facilities in its spatialneighborhood.

Given a set D of interesting objects (e.g., candidatelocations), a top-k spatial preference query retrieves thek objects in D with the highest scores. The score of anobject is defined by the quality of features (e.g., facilitiesor services) in its spatial neighborhood. As a motivatingexample, consider a real estate agency office that holdsa database with available flats for lease. Here “feature”refers to a class of objects in a spatial map such asspecific facilities or services. A customer may want torank the contents of this database with respect to thequality of their locations, quantized by aggregating non-spatial characteristics of other features (e.g., restaurants,cafes, hospital, market, etc.) in the spatial neighborhoodof the flat (defined by a spatial range around it). Qualitymay be subjective and query-parametric. For example,a user may define quality with respect to non-spatialattributes of restaurants around it (e.g., whether theyserve seafood, price range, etc.).

As an other example, the user (e.g., a tourist) wishes

• M. L. Yiu is with the Department of Computing, Hong Kong PolytechnicUniversity, Hung Hom, Hong Kong.E-mail: [email protected]

• H. Lu is with the Department of Computer Science, Aalborg University,DK-9220 Aalborg, Denmark.E-mail: [email protected]

• N. Mamoulis is with the Department of Computer Science, University ofHong Kong, Pokfulam Road, Hong Kong.E-mail: [email protected]

• M. Vaitis is with the Department of Geography, University of the Aegean,Mytilene, Greece.E-mail: [email protected]

• This work was supported by grant HKU 715509E from Hong Kong RGC.

x0.5 1

0.5

1

y

p2

p1

p3

0.5

0.7

0.7

0.8

0.6

0.9

1.0

0.2 ε

0.1

ε

p5

s1

ε ε

s2

s3

0.9

1.0

0.3

(a) range score, ε=0.2 km (b) influence score, ε=0.2 km

Fig. 1. Examples of top-k spatial preference queries

to find a hotel p that is close to a high-quality restaurantand a high-quality cafe. Figure 1a illustrates the locationsof an object dataset D (hotels) in white, and two featuredatasets: the set F1 (restaurants) in gray, and the set F2

(cafes) in black. Feature points are labeled by qualityvalues that can be obtained from rating providers (e.g.,http://www.zagat.com/). For the ease of discussion, thequalities are normalized to values in [0, 1]. The score τ(p)of a hotel p is defined in terms of: (i) the maximumquality for each feature in the neighborhood region ofp, and (ii) the aggregation of those qualities.

A simple score instance, called the range score, binds theneighborhood region to a circular region at p with radiusε (shown as a circle), and the aggregate function to SUM.For instance, the maximum quality of gray and blackpoints within the circle of p1 are 0.9 and 0.6 respectively,so the score of p1 is τ(p1) = 0.9 + 0.6 = 1.5. Similarly, weobtain τ(p2) = 1.0+0.1 = 1.1 and τ(p3) = 0.7+0.7 = 1.4.Hence, the hotel p1 is returned as the top result.

In fact, the semantics of the aggregate function isrelevant to the user’s query. The SUM function attemptsto balance the overall qualities of all features. For theMIN function, the top result becomes p3, with the scoreτ(p3) = min{0.7, 0.7} = 0.7. It ensures that the topresult has reasonably high qualities in all features. Forthe MAX function, the top result is p2, with τ(p2) =max{1.0, 0.1} = 1.0. It is used to optimize the qualityin a particular feature, but not necessarily all of them.

2

The neighborhood region in the above spatial prefer-ence query can also be defined by other score functions.A meaningful score function is the influence score (seeSection 4). As opposed to the crisp radius ε constraint inthe range score, the influence score smoothens the effectof ε and assigns higher weights to cafes that are closerto the hotel. Figure 1b shows a hotel p5 and three cafess1, s2, s3 (with their quality values). The circles have theirradii as multiples of ε. Now, the score of a cafe si iscomputed by multiplying its quality with the weight 2−j ,where j is the order of the smallest circle containing si.For example, the scores of s1, s2, and s3 are 0.3/21 = 0.15,0.9/22 = 0.225, and 1.0/23 = 0.125 respectively. Theinfluence score of p5 is taken as the highest value (0.225).

Traditionally, there are two basic ways for rankingobjects: (i) spatial ranking, which orders the objectsaccording to their distance from a reference point, and(ii) non-spatial ranking, which orders the objects by anaggregate function on their non-spatial values. Our top-k spatial preference query integrates these two types ofranking in an intuitive way. As indicated by our exam-ples, this new query has a wide range of applications inservice recommendation and decision support systems.

To our knowledge, there is no existing efficient solu-tion for processing the top-k spatial preference query. Abrute-force approach (to be elaborated in Section 3.2) forevaluating it is to compute the scores of all objects inD and select the top-k ones. This method, however, isexpected to be very expensive for large input datasets.In this paper, we propose alternative techniques that aimat minimizing the I/O accesses to the object and featuredatasets, while being also computationally efficient. Ourtechniques apply on spatial-partitioning access methodsand compute upper score bounds for the objects indexedby them, which are used to effectively prune the searchspace. Specifically, we contribute the branch-and-boundalgorithm (BB) and the feature join algorithm (FJ) forefficiently processing the top-k spatial preference query.

Furthermore, this paper studies three relevant exten-sions that have not been investigated in our preliminarywork [1]. The first extension (Section 3.4) is an optimizedversion of BB that exploits a more efficient technique forcomputing the scores of the objects. The second exten-sion (Section 3.6) studies adaptations of the proposedalgorithms for aggregate functions other than SUM, e.g.,the functions MIN and MAX. The third extension (Section4) develops solutions for the top-k spatial preferencequery based on the influence score.

The rest of this paper is structured as follows. Section2 provides background on basic and advanced querieson spatial databases, as well as top-k query evaluation inrelational databases. Section 3 defines the top-k spatialpreference query and presents our solutions. Section4 studies the query extension for the influence score.In Section 5, our query algorithms are experimentallyevaluated with real and synthetic data. Finally, Section6 concludes the paper with future research directions.

2 BACKGROUND AND RELATED WORK

Object ranking is a popular retrieval task in various ap-plications. In relational databases, we rank tuples usingan aggregate score function on their attribute values [2].For example, a real estate agency maintains a databasethat contains information of flats available for rent. Apotential customer wishes to view the top-10 flats withthe largest sizes and lowest prices. In this case, the scoreof each flat is expressed by the sum of two qualities:size and price, after normalization to the domain [0,1](e.g., 1 means the largest size and the lowest price). Inspatial databases, ranking is often associated to nearestneighbor (NN) retrieval. Given a query location, we areinterested in retrieving the set of nearest objects to it thatqualify a condition (e.g., restaurants). Assuming that theset of interesting objects is indexed by an R-tree [3], wecan apply distance bounds and traverse the index in abranch-and-bound fashion to obtain the answer [4].

Nevertheless, it is not always possible to use multi-dimensional indexes for top-k retrieval. First, such in-dexes break-down in high dimensional spaces [5], [6].Second, top-k queries may involve an arbitrary set ofuser-specified attributes (e.g., size and price) from pos-sible ones (e.g., size, price, distance to the beach, numberof bedrooms, floor, etc.) and indexes may not be availablefor all possible attribute combinations (i.e., they are tooexpensive to create and maintain). Third, informationfor different rankings to be combined (i.e., for differentattributes) could appear in different databases (in adistributed database scenario) and unified indexes maynot exist for them. Solutions for top-k queries [7], [2], [8],[9] focus on the efficient merging of object rankings thatmay arrive from different (distributed) sources. Theirmotivation is to minimize the number of accesses to theinput rankings until the objects with the top-k aggregatescores have been identified. To achieve this, upper andlower bounds for the objects seen so far are maintainedwhile scanning the sorted lists.

In the following subsections, we first review the R-tree, which is the most popular spatial access methodand the NN search algorithm of [4]. Then, we surveyrecent research of feature-based spatial queries.

2.1 Spatial Query Evaluation on R-treesThe most popular spatial access method is the R-tree [3],which indexes minimum bounding rectangles (MBRs) ofobjects. Figure 2 shows a set D = {p1, . . . , p8} of spatialobjects (e.g., points) and an R-tree that indexes them.R-trees can efficiently process main spatial query types,including spatial range queries, nearest neighbor queries,and spatial joins. Given a spatial region W , a spatial rangequery retrieves from D the objects that intersect W . Forinstance, consider a range query that asks for all objectswithin the shaded area in Figure 2. Starting from theroot of the tree, the query is processed by recursivelyfollowing entries, having MBRs that intersect the queryregion. For instance, e1 does not intersect the query

3

region, thus the subtree pointed by e1 cannot contain anyquery result. In contrast, e2 is followed by the algorithmand the points in the corresponding node are examinedrecursively to find the query result p7.

A nearest neighbor (NN) query takes as input a queryobject q and returns the closest object in D to q. Forinstance, the nearest neighbor of q in Figure 2 is p7. Itsgeneralization is the k-NN query, which returns the kclosest objects to q, given a positive integer k. NN (andk-NN) queries can be efficiently processed using the best-first (BF) algorithm of [4], provided that D is indexed byan R-tree. A min-heap H which organizes R-tree entriesbased on the (minimum) distance of their MBRs to q isinitialized with the root entries. In order to find the NNof q in Figure 2, BF first inserts to H entries e1, e2, e3, andtheir distances to q. Then the nearest entry e2 is retrievedfrom H and objects p1, p7, p8 are inserted to H . The nextnearest entry in H is p7, which is the nearest neighbor ofq. In terms of I/O, the BF algorithm is shown to be noworse than any NN algorithm on the same R-tree [4].

The aggregate R-tree (aR-tree) [10] is a variant of the R-tree, where each non-leaf entry augments an aggregatemeasure for some attribute value (measure) of all pointsin its subtree. As an example, the tree shown in Figure2 can be upgraded to a MAX aR-tree over the point set,if entries e1, e2, e3 contain the maximum measure valuesof sets {p2, p3}, {p1, p8, p7}, {p4, p5, p6}, respectively. As-sume that the measure values of p4, p5, p6 are 0.2, 0.1, 0.4,respectively. In this case, the aggregate measure aug-mented in e3 would be max{0.2, 0.1, 0.4} = 0.4. In thispaper, we employ MAX aR-trees for indexing the featuredatasets (e.g., restaurants), in order to accelerate theprocessing of top-k spatial preference queries.

Given a feature dataset F and a multi-dimensionalregion R, the range top-k query selects the tuples (fromF) within the region R and returns only those with the khighest qualities. Hong et al. [11] indexed the dataset bya MAX aR-tree and developed an efficient tree traversalalgorithm to answer the query. Instead of finding thebest k qualities from F in a specified region, our (range-score) query considers multiple spatial regions based onthe points from the object dataset D, and attempts tofind out the best k regions (based on scores derived frommultiple feature datasets Fc).

p1

p2 p3

p4

p5p6

p7

p8

x

y

(0.1)

(0.3)

(0.5)(0.2)

(0.2)

(0.5)

(0.1)(0.9)

q p2 p3 p1 p8 p7 p4 p5 p6

e1 e2 e3

p1

p2 p3

p4

p5p6

p7

p8

x

y

q p2 p3 p1 p8 p7 p4 p5 p6

e1 e2 e3

5 10 15

5

10

15

e .MBR2

e .MBR1

e .MBR3

Fig. 2. Spatial queries on R-trees

2.2 Feature-based Spatial QueriesXia et al. [12] solved the problem of finding top-k sites(e.g., restaurants) based on their influence on feature

points (e.g., residential buildings). As an example, Figure3a shows a set of sites (white points), a set of features(black points with weights), such that each line links afeature point to its nearest site. The influence of a sitepi is defined by the sum of weights of feature pointshaving pi as their closest site. For instance, the score ofp1 is 0.9+0.5=1.4. Similarly, the scores of p2 and p3 are 1.5and 1.2 respectively. Hence, p2 is returned as the top-1influential site.

Related to top-k influential sites query are the optimallocation queries studied in [13], [14]. The goal is to findthe location in space (not chosen from a specific set ofsites) that minimizes an objective function. In Figures3b and 3c, feature points and existing sites are shownas black and gray points respectively. Assume that allfeature points have the same quality. The maximuminfluence optimal location query [13] finds the location(to insert to the existing set of sites) with the maximuminfluence (as defined in [12]), whereas the minimumdistance optimal location query [14] searches for thelocation that minimizes the average distance from eachfeature point to its nearest site. The optimal locations forboth queries are marked as white points in Figures 3b,crespectively.

x0.5 1

0.5

1

y

p2

p1

p3

0.8

0.6

0.5

0.40.2

0.7

0.9

max-influence

min-dist

(a) top-k influential (b) max-influence (c) min-distance

Fig. 3. Influential sites and optimal location queries

The techniques proposed in [12], [13], [14] are specificto the particular query types described above and cannotbe extended for our top-k spatial preference queries.Also, they deal with a single feature dataset whereas ourqueries consider multiple feature datasets.

Recently, novel spatial queries and joins [15], [16], [17],[18] have been proposed for various spatial decisionsupport problems. However, they do not utilize non-spatial qualities of facilities to define the score of alocation. Finally, [19], [20] studied the evaluation oftextual location-based queries on spatial objects.

3 SPATIAL PREFERENCE QUERIES

Section 3.1 formally defines the top-k spatial preferencequery problem and describes the index structures forthe datasets. Section 3.2 studies two baseline algorithmsfor processing the query. Section 3.3 presents an effi-cient branch-and-bound algorithm for the query, and itsfurther optimization is proposed in Section 3.4. Section3.5 develops a specialized spatial join algorithm forevaluating the query. Finally, Section 3.6 extends theabove algorithms for answering top-k spatial preferencequeries involving other aggregate functions.

4

3.1 Definitions and Index Structures

Let Fc be a feature dataset, in which each feature objects ∈ Fc is associated with a quality ω(s) and a spatialpoint. We assume that the domain of ω(s) is the interval[0, 1]. As an example, the quality ω(s) of a restaurant smay be obtained from a ratings provider.

Let D be an object dataset, where each object p ∈ D isa spatial point. In other words, D is the set of interestingpoints (e.g, hotel locations) considered by the user.

Given an object dataset D and m feature datasetsF1,F2, · · · ,Fm, the top-k spatial preference query retrievesthe k points in D with the highest score. Here, the scoreof an object point p ∈ D is defined as:

τθ(p) = AGG { τθc (p) | c ∈ [1,m] } (1)

where AGG is an aggregate function and τθc (p) is the (c-th)component score of p with respect to the neighborhoodcondition θ and the (c-th) feature dataset Fc.

We proceed to elaborate the aggregate function andthe component score function. Typical examples of theaggregate function AGG are: SUM, MIN, MAX. We first focuson the case where AGG is SUM. In Section 3.6, we willdiscuss the generic scenario where AGG is an arbitrarymonotone aggregate function.

An intuitive choice for the component score functionτθc (p) is: the range score τ rngc (p), taken as the maximumquality ω(s) of points s ∈ Fc that are within a givenparameter distance ε from p, or 0 if no such point exists.

τ rngc (p) = max({ω(s) | s ∈ Fc ∧ dist(p, s) ≤ ε} ∪ {0}) (2)

In our problem setting, the user requires that an objectp ∈ D must not be considered as a result if there existssome Fc such that the neighborhood region of p doesnot contain any feature point of Fc.

There are other choices for the component score func-tion τθc (p). One example is the influence score functionτ infc (p) which will be considered in Section 4. Anotherexample is the nearest neighbor (NN) score τnnc (p) thathas been studied in our previous work [1], so it willnot be examined again in this paper. The condition θ isdropped whenever the context is clear.

In this paper, we assume that the object dataset Dis indexed by an R-tree and each feature dataset Fc isindexed by an MAX aR-tree, where each non-leaf entryaugments the maximum quality (of features) in its sub-tree. Nevertheless, our solutions are directly applicableto datasets that are indexed by other hierarchical spatialindexes (e.g., point quad-trees). The rationale of indexingdifferent feature datasets by separate aR-trees is that: (i)a user queries for only few features (e.g., restaurantsand cafes) out of all possible features (e.g., restaurants,cafes, hospital, market, etc.), and (ii) different users mayconsider different subsets of features.

Based on the above indexing scheme, we develop var-ious algorithms for processing top-k spatial preferencequeries. Table 1 lists the notations to be used throughoutthe paper.

TABLE 1List of Notations

Notation Meaninge an entry in an R-treeD the object datasetm the number of feature datasetsFc the c-th feature datasetω(s) the quality of an point s in Fcω(e) augmented quality of an aR-tree entry e of Fcp an object point of D

τθc (p) the c-the component score of pτθ(p) the overall score of p

dist(p, s) Euclidean distance between two points p and smindist(p, e) minimum distance between p and emaxdist(p, e) maximum distance between p and eT (e) upper bound score of an R-tree entry e of D

3.2 Probing Algorithms

We first introduce a brute-force solution that computesthe score of every point p ∈ D in order to obtain thequery results. Then, we propose a group evaluationtechnique that computes the scores of multiple pointsconcurrently.

Simple Probing Algorithm.According to Section 3.1, the quality ω(s) of any featurepoint s falls into the interval [0, 1]. Thus, for a point p ∈D, where not all its component scores are known, itsupper bound score τ+(p) is defined as:

τ+(p)=m∑c=1

{τc(p) if τc(p) is known1 otherwise (3)

It is guaranteed that the bound τ+(p) is greater than orequal to the actual score τ(p).

Algorithm 1 is a pseudo-code of the simple probingalgorithm (SP), which retrieves the query results bycomputing the score of every object point. The algo-rithm uses two global variables: Wk is a min-heap formanaging the top-k results and γ represents the top-k score so far (i.e., lowest score in Wk). Initially, thealgorithm is invoked at the root node of the object tree(i.e., N = D.root). The procedure is recursively applied(at Line 4) on tree nodes until a leaf node is accessed.When a leaf node is reached, the component score τc(e)(at Line 8) is computed by executing a range searchon the feature tree Fc for range score queries. Lines6-8 describe an incremental computation technique, forreducing unnecessary component score computations. Inparticular, the point e is ignored as soon as its upperbound score τ+(e) (see Equation 3) cannot be greaterthan the best-k score γ. The variables Wk and γ areupdated when the actual score τ(e) is greater than γ.

Group Probing Algorithm.Due to separate score computations for different objects,SP is inefficient for large object datasets. In view ofthis, we propose the group probing algorithm (GP),a variant of SP, that reduces I/O cost by computingscores of objects in the same leaf node of the R-treeconcurrently. In GP, when a leaf node is visited, its points

5

Algorithm 1 Simple Probing Algorithm (SP)algorithm SP(Node N )

1: for each entry e ∈ N do2: if N is non-leaf then3: read the child node N ′ pointed by e;4: SP(N ′);5: else6: for c:=1 to m do7: if τ+(e) > γ then . upper bound score8: compute τc(e) using tree Fc; update τ+(e);9: if τ(e) > γ then

10: update Wk (and γ) by e;

are first stored in a set V and then their componentscores are computed concurrently at a single traversal ofthe Fc tree. We now introduce some distance notationsfor MBRs. Given a point p and an MBR e, the valuemindist(p, e) (maxdist(p, e)) [4] denotes the minimum(maximum) possible distance between p and any pointin e. Similarly, given two MBRs ea and eb, the valuemindist(ea, eb) (maxdist(ea, eb)) denotes the minimum(maximum) possible distance between any point in eaand any point in eb.

Algorithm 2 shows the procedure for computing thec-th component score for a group of points. Considera subset V of D for which we want to compute theirτ rngc (p) score at feature tree Fc. Initially, the procedure iscalled with N being the root node of Fc. If e is a non-leafentry and its mindist from some point p ∈ V is withinthe range ε, then the procedure is applied recursively onthe child node of e, since the sub-tree of Fc rooted at emay contribute to the component score of p. In case e isa leaf entry (i.e., a feature point), the scores of points inV are updated if they are within distance ε from e.

Algorithm 2 Group Range Score Algorithmalgorithm Group Range(Node N , Set V , Value c, Value ε)

1: for each entry e ∈ N do2: if N is non-leaf then3: if ∃p ∈ V, mindist(p, e) ≤ ε then4: read the child node N ′ pointed by e;5: Group Range(N ′,V ,c, ε);6: else7: for each p ∈ V such that dist(p, e) ≤ ε do8: τc(p):=max{τc(p), ω(e)};

3.3 Branch and Bound AlgorithmGP is still expensive as it examines all objects in D andcomputes their component scores. We now propose analgorithm that can significantly reduce the number ofobjects to be examined. The key idea is to compute, fornon-leaf entries e in the object tree D, an upper boundT (e) of the score τ(p) for any point p in the subtree ofe. If T (e) ≤ γ, then we need not access the subtree of e,thus we can save numerous score computations.

Algorithm 3 is a pseudo-code of our branch and boundalgorithm (BB), based on this idea. BB is called withN being the root node of D. If N is a non-leaf node,

Lines 3-5 compute the scores T (e) for non-leaf entries econcurrently. Recall that T (e) is an upper bound scorefor any point in the subtree of e. The techniques forcomputing T (e) will be discussed shortly. Like Equation3, with the component scores Tc(e) known so far, wecan derive T+(e), an upper bound of T (e). If T+(e) ≤ γ,then the subtree of e cannot contain better results thanthose in Wk and it is removed from V . In order toobtain points with high scores early, we sort the entriesin descending order of T (e) before invoking the aboveprocedure recursively on the child nodes pointed by theentries in V . If N is a leaf node, we compute the scoresfor all points of N concurrently and then update the setWk of the top-k results. Since both Wk and γ are globalvariables, the value of γ is updated during recursive callof BB.

Algorithm 3 Branch and Bound Algorithm (BB)Wk:=new min-heap of size k (initially empty);γ:=0; . k-th score in Wk

algorithm BB(Node N )1: V :={e|e ∈ N};2: if N is non-leaf then3: for c:=1 to m do4: compute Tc(e) for all e ∈ V concurrently;5: remove entries e in V such that T+(e) ≤ γ;6: sort entries e ∈ V in descending order of T (e);7: for each entry e ∈ V such that T (e) > γ do8: read the child node N ′ pointed by e;9: BB(N ′);

10: else11: for c:=1 to m do12: compute τc(e) for all e ∈ V concurrently;13: remove entries e in V such that τ+(e) ≤ γ;14: update Wk (and γ) by entries in V ;

Upper Bound Score Computation.It remains to clarify how the (upper bound) scoresTc(e) of non-leaf entries (within the same node N ) canbe computed concurrently (at Line 4). Our goal is tocompute these upper bound scores such that• the bounds are computed with low I/O cost, and• the bounds are reasonably tight, in order to facilitate

effective pruning.To achieve this, we utilize only level-1 entries (i.e., lowestlevel non-leaf entries) in Fc for deriving upper boundscores because: (i) there are much fewer level-1 entriesthan leaf entries (i.e., points), and (ii) high level entriesin Fc cannot provide tight bounds. In our experimentalstudy, we will also verify the effectiveness and the cost ofusing level-1 entries for upper bound score computation.

Algorithm 2 can be modified for the above upperbound computation task (where input V corresponds toa set of non-leaf entries), after changing Line 2 to checkwhether child nodes of N are above the leaf level.

The following example illustrates how upper boundrange scores are derived. In Figure 4a, v1 and v2 are non-leaf entries in the object tree D and the others are level-1entries in the feature tree Fc. For the entry v1, we first

6

define its Minkowski region [21] (i.e., gray region aroundv1), the area whose mindist from v1 is within ε. Observethat only entries ei intersecting the Minkowski region ofv1 can contribute to the score of some point in v1. Thus,the upper bound score Tc(v1) is simply the maximumquality of entries e1, e5, e6, e7, i.e., 0.9. Similarly, Tc(v2) iscomputed as the maximum quality of entries e2, e3, e4, e8,i.e., 0.7. Assuming that v1 and v2 are entries in thesame tree node of D, their upper bounds are computedconcurrently to reduce I/O cost.

e1

e2 e

3

v1

e5

e6

e7

e8

e4

v2

e9

0.9

0.80.6

0.2

0.3

0.5

0.4

0.1

0.7

ε

p1

(0.7)

e1

F1

e2

F2

(0.8)

0.6

0.8

0.5

0.7 (a) upper bound scores (b) optimized computation

Fig. 4. Examples of deriving scores

3.4 Optimized Branch and Bound Algorithm

This section develops a more efficient score computationtechnique to reduce the cost of the BB algorithm.

Motivation.Recall that Lines 11-13 of the BB algorithm are used tocompute the scores of object points (i.e., leaf entries of theR-tree on D). A leaf entry e is pruned if its upper boundscore τ+(e) is not greater than the best score found so farγ. However, the upper bound score τ+(e) (see Equation3) is not tight because any unknown component score isreplaced by a loose bound (i.e., the value 1).

Let’s examine the computation of τ+(p1) for the pointp1 in Figure 4b. The entry eF1

1 is a non-leaf entryfrom the feature tree F1. Its augmented quality value isω(eF1

1 ) = 0.8. The entry points to a leaf node containingtwo feature points, whose qualities values are 0.6 and0.8 respectively. Similarly, eF2

2 is a non-leaf entry fromthe tree F2 and it points to a leaf node of feature points.

Suppose that the best score found so far in BB is γ =1.4 (not shown in the figure). We need to check whetherthe score of p1 can be higher than γ. For this, we computethe first component score τ1(p1) = 0.6 by accessing thechild node of eF1

1 . Now, we have the upper bound scoreof p1 as τ+(p) = 0.6 + 1.0 = 1.6. Such a bound is aboveγ = 1.4 so we need to compute the second componentscore τ2(p1) = 0.5 by accessing the child node of eF2

2 .The exact score of p1 is τ(p1) = 0.6 + 0.5 = 1.1; the pointp1 is then pruned because τ(p1) ≤ γ. In summary, twoleaf nodes are accessed during the computation of τ(p1).

Our observation here is that the point p1 can be prunedearlier, without accessing the child node of eF2

2 . By takingthe maximum quality of level-1 entries (from F2) thatintersect the ε-range of p1, we derive: τ2(p1) ≤ ω(eF2

2 ) =0.7. With the first component score τ1(p1) = 0.6, we infer

that: τ(p1) ≤ 0.6 + 0.7 = 1.3. Such a value is below γ sop1 can be pruned.

Optimized computation of scores.Based on our observation, we propose a tighter deriva-tion for the upper bound score of p than the one shownin Equation 3.

Let p be an object point in D. Suppose that we have tra-versed some paths of the feature trees on F1,F2, · · · ,Fm.Let µc be an upper bound of the quality value for anyunvisited entry (leaf or non-leaf) of the feature tree Fc.We then define the function τ?(p) as:

τ?(p)=m∑c=1

max({ω(s)|s ∈ Fc, dist(p, s) ≤ ε, ω(s) ≥ µc}∪{µc})

(4)In the max function, the first set denotes the upper boundquality of any visited feature point within distance εfrom p. The following lemma shows that the value τ?(p)is always greater than or equal to the actual score τ(p).

Lemma 1: It holds that τ?(p) ≥ τ(p), for any p ∈ D.Proof: If the actual component score τc(p) is above µc,

then τc(p)=max{ω(s) | s ∈ Fc, dist(p, s) ≤ ε, ω(s) ≥ µc}.Otherwise, we derive τc(p) ≤ µc. In both cases, we haveτc(p) ≤ max({ω(s) | s ∈ Fc, dist(p, s) ≤ ε, ω(s) ≥ µc} ∪{µc} ). Therefore, we have τ?(p) ≥ τ(p).

According to Equation 4, the value τ?(p) is tight onlywhen every µc value is low. In order to achieve this,we access the feature trees in a round-robin fashion, andtraverse the entries in each feature tree in descendingorder of quality values. Round-robin is a popular andeffective strategy used for efficient merging of rankings[7], [9]. Alternative strategies include the selectivity-based strategy and the fractal-dimension strategy [22].These strategies are designed specifically for coping withhigh dimensional data, however in our problem settingthey have insignificant performance gain over round-robin.

Algorithm 4 is the pseudo-code for computing thescores of objects efficiently from the feature treesF1,F2, · · · ,Fm. The set V contains objects whose scoresneed to be computed. ε refers to the distance threshold ofthe range score, and γ represents the best score found sofar. For each feature tree Fc, we employ a max-heap Hc

to traverse the entries of Fc in descending order of theirquality values. The root of Fc is first inserted into Hc.The variable µc maintains the upper bound quality ofentries in the tree that will be visited. We then initializeeach component score τc(p) of every object p ∈ V to 0.

At Line 7, the variable α keeps track of the ID of thecurrent feature tree being processed. The loop at Line 8is used to compute the scores for the points in the setV . We then deheap an entry e from the current heapHα. The property of the max-heap guarantees that thequality value of any future entry deheaped from Hα isat most ω(e). Thus, the bound µc is updated to ω(e). AtLines 11–12, we prune the entry e if its distance fromeach object point p ∈ V is larger than ε. In case e is notpruned, we compute the tight upper bound score τ?(p)

7

for each p ∈ V (by Equation 4); the object p is removedfrom V if τ?(p) ≤ γ (Lines 13–15).

Next, we access the child node pointed to by e, andexamine each entry e′ in the node (Lines 16–17). A non-leaf entry e′ is inserted into the heap Hα if its minimumdistance from some p ∈ V is within ε (Lines 18–20);whereas a leaf entry e′ is used to update the componentscore τα(p) for any p ∈ V within distance ε from e′ (Lines22–23). At Line 24, we apply the round robin strategy tofind the next α value such that the heap Hα is not empty.The loop at Line 8 repeats while V is not empty and thereexists a non-empty heap Hc. At the end, the algorithmderives the exact scores for the remaining points of V .

Algorithm 4 Optimized Group Range Score Algorithmalgorithm Optimized Group Range(Trees F1,F2, · · · ,Fm,Set V , Value ε, Value γ)

1: for c:=1 to m do2: Hc:=new max-heap (with quality score as key);3: insert Fc.root into Hc;4: µc:=1;5: for each entry p ∈ V do6: τc(p):=0;7: α:=1; . ID of the current feature tree8: while |V | > 0 and there exists a non-empty heap Hc do9: deheap an entry e from Hα;

10: µα:=ω(e); . update threshold11: if ∀p ∈ V,mindist(p, e) > ε then12: continue at Line 8;13: for each p ∈ V do . prune unqualified points14: if (

∑mc=1 max{µc, τc(p)}) ≤ γ then

15: remove p from V ;16: read the child node CN pointed to by e;17: for each entry e′ of CN do18: if CN is a non-leaf node then19: if ∃p ∈ V, mindist(p, e′) ≤ ε then20: insert e′ into Hα;21: else . update component scores22: for each p ∈ V such that dist(p, e′) ≤ ε do23: τα(p):=max{τα(p), ω(e′)};24: α:=next (round-robin) value where Hα is not empty;25: for each entry p ∈ V do26: τ(p):=

∑mc=1 τc(p);

The BB* Algorithm.Based on the above, we extend BB (Algorithm 3) to anoptimized BB* algorithm as follows. First, Lines 11–13 ofBB are replaced by a call to Algorithm 4, for computingthe exact scores for object points in the set V . Second,Lines 3–5 of BB are replaced by a call to a modifiedAlgorithm 4, for deriving the upper bound scores fornon-leaf entries (in V ). Such a modified Algorithm 4 isobtained after replacing Line 18 by checking whether thenode CN is a non-leaf node above the level-1.

3.5 Feature Join AlgorithmAn alternative method for evaluating a top-k spatialpreference query is to perform a multi-way spatial join[23] on the feature trees F1,F2, · · · ,Fm to obtain com-binations of feature points which can be in the neigh-borhood of some object from D. Spatial regions which

correspond to combinations of high scores are then ex-amined, in order to find data objects in D having the cor-responding feature combination in their neighborhood.In this section, we first introduce the concept of a combi-nation, then discuss the conditions for a combination tobe pruned, and finally elaborate the algorithm used toprogressively identify the combinations that correspondto query results.

A1

B2

≤ 2ε

a3

b4

≤ 2ε

(a) non-leaf combination (b) leaf combination

Fig. 5. Qualified combinations for the join

Tuple 〈f1, f2, · · · , fm〉 is a combination if, for any c ∈[1,m], fc is an entry (either leaf or non-leaf) in the featuretree Fc. The score of the combination is defined by:

τ(〈f1, f2, · · · , fm〉) =m∑c=1

ω(fc) (5)

For a non-leaf entry fc, ω(fc) is the MAX of all featurequalities in its subtree (stored with fc, since Fc is anaR-tree). A combination disqualifies the query if:

∃ (i 6= j ∧ i, j ∈ [1,m]), mindist(fi, fj) > 2ε (6)

When such a condition holds, it is impossible to have apoint in D whose mindist from fi and fj are within εrespectively. The above validity check acts as a multiwayjoin condition that significantly reduces the number ofcombinations to be examined.

Figure 5a and 5b illustrate the condition for a non-leaf combination 〈A1, B2〉 and a leaf combination 〈a3, b4〉,respectively, to be a candidate combination for the query.

Algorithm 5 is a pseudo-code of our feature join(FJ) algorithm. It employs a max-heap H for managingcombinations of feature entries in descending order oftheir combination scores. The score of a combination〈f1, f2, · · · , fm〉 as defined in Equation 5 is an upperbound of the scores of all combinations 〈s1, s2, · · · , sm〉of feature points, such that sc is located in the subtreeof fc for each c ∈ [1,m]. Initially, the combination withthe root pointers of all feature trees is enheaped. Weprogressively deheap the combination with the largestscore. If all its entries point to leaf nodes, then we loadthese nodes L1, · · · , Lm and call Find Result to traversethe object R-tree D and find potential results. Find Resultis a variant of the BB algorithm, with the followingdifference: L1, · · · , Lm are viewed as m tiny feature trees(each with one node) and accesses to them incur no extraI/O cost.

In case not all entries of the deheaped combinationpoint to leaf nodes (Line 9 of FJ), we select the one at thehighest level, access its child node Nc and then form newcombinations with the entries in Nc. A new combination

8

Algorithm 5 Feature Join Algorithm (FJ)Wk:=new min-heap of size k (initially empty);γ:=0; . k-th score in Wk

algorithm FJ(Tree D,Trees F1,F2, · · · ,Fm)1: H :=new max-heap (combination score as the key);2: insert 〈F1.root,F2.root, · · · ,Fm.root〉 into H ;3: while H is not empty do4: deheap 〈f1, f2, · · · , fm〉 from H ;5: if ∀ c ∈ [1,m], fc points to a leaf node then6: for c:=1 to m do7: read the child node Lc pointed by fc;8: Find Result(D.root, L1, · · · , Lm);9: else

10: fc:=highest level entry among f1, f2, · · · , fm;11: read the child node Nc pointed by fc;12: for each entry ec ∈ Nc do13: insert 〈f1, f2, · · · , ec, · · · , fm〉 into H if its score

is greater than γ and it qualifies the query;

algorithm Find Result(Node N , Nodes L1, · · · , Lm)1: for each entry e ∈ N do2: if N is non-leaf then3: compute T (e) by entries in L1, · · · , Lm;4: if T (e) > γ then5: read the child node N ′ pointed by e;6: Find Result(N ′, L1, · · · , Lm);7: else8: compute τ(e) by entries in L1, · · · , Lm;9: update Wk (and γ) by e (when necessary);

is inserted into H for further processing if its score ishigher than γ and it qualifies the query. The loop (atLine 3) continues until H becomes empty.

3.6 Extension to Monotonic Aggregate FunctionsWe now extend our proposed solutions for processingthe top-k spatial preference query defined by any mono-tonic aggregate function AGG. Examples of AGG include(but not limited to) the MIN and MAX functions.

Adaptation of incremental computation.Recall that the incremental computation technique isapplied by algorithms SP, GP, and BB, for reducing I/Ocost. Specifically, even if some component score τc(p) ofa point p has not been computed yet, the upper boundscore τ+(p) of p can be derived by Equation 3. Wheneverτ+(p) drops below the best score found so far γ, thepoint p can be discarded immediately without needingto compute the unknown component scores of p.

In fact, the algorithms SP, GP, and BB are directlyapplicable to any monotonic aggregate function AGGbecause Equation 3 can be generalized for AGG. Now,the upper bound score τ+(p) of p is defined as:

τ+(p)=AGG mc=1

{τc(p) if τc(p) is known1 otherwise (7)

Due to the monotonicity property of AGG, the boundτ+(p) is guaranteed to be greater than or equal to theactual score τ(p).

Adaptation of upper bound computation.

The BB* and FJ algorithms compute the upper boundscore of a non-leaf entry of the object tree D or acombination of entries from feature trees, by summingits upper bound component scores. Both BB* and FJare applicable to any monotonic aggregate function AGG,with only the slight modifications discussed below. ForBB*, we replace the summation operator by AGG, inEquation 4, and at Lines 14 and 26 of Algorithm 4. ForFJ, we replace the summation by AGG, in Equation 5.

4 INFLUENCE SCORE

This section first studies the influence score function thatcombines both the qualities and relative locations offeature points. It then presents the adaptations of oursolutions in Section 3 for the influence score function.Finally, we discuss how our solutions can be used forother types of influence score functions.

4.1 Score Definition

The range score has a drawback that the parameter εis not easy to set. Consider for instance the exampleof the range score τ rng() in Figure 6a, where the whitepoints are object points in D, the gray points and blackpoints are feature points in the feature sets F1 and F2

respectively. If ε is set to 0.2 (shown by circles), thenthe object p2 has the score τ rng(p2) = 0.9 + 0.1 = 1.0and it cannot be the best object (as τ rng(p1) = 1.2). Thishappens because a high-quality black feature is barelyoutside the ε-range of p2. Had ε been slightly larger, thatblack feature would contribute to the score of p2, makingit the best object.

In the field of statistics, the Gaussian density function[24] has been used to estimate the density in the space,from a set F of points. The density at location p isestimated as: G(p) =

∑f∈F exp(−dist

2(p,f)2σ2 ), where σ

is a parameter. Its advantage is that the value G(p) isnot sensitive to a slight change in σ. G(p) is mainlycontributed by the points (of F) close to p and weaklyaffected by the points far away.

Inspired by the above function, we devise a scorefunction such that it is not too sensitive to the rangeparameter ε. In addition, the users in our applicationusually prefer a high-quality restaurant (i.e., a featurepoint) rather than a large number of low-quality restau-rants. Therefore, we use the maximum operator ratherthan the summation in G(p). Specifically, we define theinfluence score of an object point p with respect to thefeature set Fc as:

τ infc (p) = max{ ω(s) · 2−dist(p,s)

ε | s ∈ Fc } (8)

where ω(s) is the quality of s, ε is a user-specified range,and dist(p, s) is the distance between p and s.

The overall score τ inf (p) of p is then defined as:

τ inf (p) = AGG { τ infc (p) | c ∈ [1,m] } (9)

9

where AGG is a monotone aggregate operator and m isthe number of feature datasets. Again, we focus on thecase where AGG is the SUM function.

Let us compute the influence score τ inf () for thepoints in Figure 6a, assuming ε = 0.2. From Figure 6a,we obtain τ inf (p1) = max{0.7 · 2−

0.180.20 , 0.9 · 2−

0.500.20 } +

max{0.5 · 2−0.180.20 , 0.1 · 2−

0.600.20 , 0.6 · 2−

0.800.20 } = 0.643 and

τ inf (p2) = max{0.9 · 2−0.180.20 , 0.7 · 2−

0.650.20 } + max{0.1 ·

2−0.190.20 , 0.6 ·2− 0.22

0.20 , 0.5 ·2− 0.700.20 } = 0.762. The top-1 point is

p2, implying that the influence score can capture featurepoints outside the range ε = 0.2. In fact, the influencescore function possesses two nice properties. Firstly, afeature point s that is barely outside the range ε (fromthe object point p) still has potential to contribute to thescore, provided that its quality ω(s) is sufficiently high.Secondly, the distance dist(p, s) has an exponentiallydecaying effect on the score, meaning that feature pointsnearer to p contribute higher scores.

x0.5 1

0.5

1

y

p2

p1

0.6

0.5

0.7

0.9

0.1

0.2

0.8e1

e2p

0.7

0.8

e3

0.3

(a) exact score (b) upper bound property

Fig. 6. Example of influence score (ε = 0.2)

4.2 Query Processing for SP, GP, BB, and BB*We now examine the extensions of the SP, GP, BB,and BB* algorithms for top-k spatial preference queriesdefined by the influence score in Equation 8.

Incremental computation technique.Observe that the upper bound of τ infc (p) is 1. Therefore,Equation 3 still holds for the influence score, and theincremental computation technique (see Section 3.2) canstill be applied in SP, GP, and BB.

Exact score computation for a single object.For the SP algorithm, we elaborate how to compute thescore τ infc (p) (see Equation 8) of an object p ∈ D. Thisis challenging because some feature s ∈ Fc outside theε-range of p may contribute to the score. Unlike thecomputation of the range score, we can no longer usethe ε-range to restrict the search space.

Given an object point p and an entry e from the featuretree of Fc, we define the upper bound function:

ωinf (e, p) = ω(e) · 2−mindist(p,e)

ε (10)

In case e is a leaf entry (i.e., a feature point s), we haveωinf (s, p) = ω(s) ·2−

dist(p,s)ε . The following lemma shows

that the value ωinf (e, p) is an upper bound of ωinf (e′, p)for any entry e′ in the subtree of e.

Lemma 2: Let e and e′ be entries from the feature treeFc such that e′ is in the subtree of e. It holds thatωinf (e, p) ≥ ωinf (e′, p), for any object point p ∈ D.

Proof: Let p be any object point p ∈ D. Since e′

falls into the subtree of e, we have: mindist(p, e) ≤mindist(p, e′). As Fc is a MAX aR-tree, we have: ω(e) ≥ω(e′). Thus, we have: ωinf (e, p) ≥ ωinf (e′, p).

As an example, Figure 6b shows an object point pand three entries e1, e2, e3 from the same feature tree.Note that e2 and e3 are in the subtree of e1. The dottedlines indicate the minimum distance from p to e1 and e3

respectively. Thus, we have ωinf (e1, p) = 0.8 ·2− 0.20.2 = 0.4

and ωinf (e3, p) = 0.7 · 2− 0.30.2 = 0.247. Clearly, ωinf (e1, p)

is larger than ωinf (e3, p).By using Lemma 2, we apply the best-first approach

to compute the exact component score τc(p) for the pointp; Algorithm 6 employs a max-heap H in order to visitthe entries of the tree in descending order of their ωinf

values. We first insert the root of the tree Fc into H ,and initialize τc(p) to 0. The loop at Line 4 continues aslong as H is not empty. At Line 5, we deheap an entry efrom H . If the value ωinf (e, p) is above the current τc(p),then there is potential to update τc(p) by using somepoint in the subtree of e. In that case, we read the childnode pointed to by e, and examine each entry e′ in thatnode (Lines 7–8). If e′ is a non-leaf entry, it is insertedinto H provided that its ωinf (e′, p) value is above τc(p).Otherwise, it is used to update τc(p).

Algorithm 6 Object Influence Score Algorithmalgorithm Object Influence(Point p, Value c, Value ε)

1: H :=new max-heap (with ωinf value as key);2: insert 〈Fc.root, 1.0〉 into H ;3: τc(p):=0;4: while H is not empty do5: deheap an entry e from H ;6: if ωinf (e, p) > τc(p) then7: read the child node CN pointed to by e;8: for each entry e′ of CN do9: if CN is a non-leaf node then

10: if ωinf (e′, p) > τc(p) then11: insert 〈e′, ωinf (e′, p)〉 into H ;12: else . update component score13: τc(p):=max{τc(p), ωinf (e′, p)};

Group computation and upper bound computation.Recall that, for the case of range scores, both the GP andBB algorithms apply the group computation technique(Algorithm 2) for concurrently computing the compo-nent score τc(p) for every object point p in a given set V .Now, Algorithm 6 can be modified as follows to supportconcurrent computation of influence scores. Firstly, theparameter p is replaced by a set V of objects. Second,we initialize the value τc(p) for each object p ∈ V at Line3 and perform the score update for each p ∈ V at Line13. Thirdly, the conditions at Lines 6 and 10 are checkedwhether they are satisfied by some object p ∈ V .

In addition, the BB algorithm (see Algorithm 3) needsto compute the upper bound component score Tc(e) for

10

all non-leaf entries in the current node simultaneously.Again, Algorithm 6 can be modified for this purpose.

Optimized computation of scores in BB*.Given an entry e (from a feature tree), we define theupper bound score of e using a set V of points as:

ωinf (e, V ) = maxp∈V

ωinf (e, p) (11)

The BB* algorithm applies Algorithm 4 to compute therange scores for a set V of object points. With Equation11, we can modify Algorithm 4 to compute the influencescore, with the following changes. Firstly, the heap Hc

(at Line 2) is used to organize its entries e in descendingorder of the key ωinf (e, V ), and the value ω(e) (at Line10) is replaced by ωinf (e, V ). Secondly, the restrictionsbased on the ε-range (at Lines 11–12, 19, 22) are removed.Thirdly, the value ω(e′) (at Line 23) needs to be replacedby ωinf (e′, p).

4.3 Query Processing for FJ

The FJ algorithm can be adapted for the influencescore, but with two changes. Recall that the tuple〈f1, f2, · · · , fm〉 is said to be a combination if fc is anentry in the feature tree Fc, for any c ∈ [1,m].

First, Equation 6 can no longer be used to prunea combination based on distances among the entriesin the combination. Any possible combination must beconsidered if its upper bound score is above the bestscore found so far γ.

Second, Equation 5 is now a loose upper bound valuefor the influence score because it ignores the distancesamong the entries fc. Therefore, we need to develop atighter upper bound for the influence score.

The following lemma shows that, given a set Φ ofrectangles that partitions the spatial domain DOM, thevalue maxr∈Φ

∑mc=1 ω

inf (fc, r) is an upper bound of thevalue ωinf (fc, p) for any point p (in DOM).

Lemma 3: Let Φ be a set of rectangles which parti-tion the spatial domain DOM. Given the tree entriesf1, f2, · · · , fm (from the respective feature trees), it holdsthat maxr∈Φ

∑mc=1 ω

inf (fc, r) ≥∑mc=1 ω

inf (fc, p), for anypoint p ∈ DOM.

Proof: Let p be a point in DOM. There exists anrectangle r′ ∈ Φ such that p falls into r′. Thus, for anyc ∈ [1,m], we have mindist(fc, r′) ≤ mindist(fc, p), andderive ωinf (fc, r′) ≥ ωinf (fc, p). By summing all com-ponents, we obtain

∑mc=1 ω

inf (fc, r′) ≥∑mc=1 ω

inf (fc, p).As r′ ∈ Φ, we also have maxr∈Φ

∑mc=1 ω

inf (fc, r) ≥∑mc=1 ω

inf (fc, r′). Therefore the lemma is proved.In fact, the above upper bound value

maxr∈Φ

∑mc=1 ω

inf (fc, r) can be tightened by dividingthe rectangles of Φ into smaller rectangles.

Figure 7a shows the combination 〈f1, f2〉, whose en-tries belong to the feature trees F1 and F2 respec-tively. We first partition the domain space into fourrectangles r1, r2, r3, r4, and then compute their up-per bound values (shown in the figure). Thus, the

current upper bound score of 〈f1, f2〉 is taken as:max{1.5, 1.0, 0.1, 0.8} = 1.5. To tighten the upper bound,we pick the rectangle (r1) with the highest value andpartition it into four rectangles r11, r12, r13, r14 (see Fig-ure 7b). Now, the upper bound score of 〈f1, f2〉 becomes:max{0.5, 1.1, 0.8, 0.7, 1.0, 0.1, 0.8} = 1.1. By applying thismethod iteratively, the upper bound score can be grad-ually tightened.

f1

f2

r1

r2

r3r

4

1.5

0.1

1.0

0.8

f1

r11

r12

r13

r14

f2

0.5

r2

r3r

4

0.1

1.0

0.8

1.1

0.7 0.8

(a) first iteration (b) second iteration

Fig. 7. Deriving upper bound of the influence score for FJ

Algorithm 7 is the pseudo for computing the upperbound score for the combination 〈f1, f2, · · · , fm〉 of fea-ture entries. The parameter γ represents the best scorefound so far (in FJ). The value Imax is used to control thenumber of iterations in the algorithm; its typical valueis 20. At Line 1, we employ a max-heap H to organizeits rectangles in descending order of their upper boundscores. Then, we insert the spatial domain rectangle intoH . The loop at Line 4 continues while H is not emptyand Imax > 0. After deheaping a rectangle r from H(Line 5), we partition it into four child rectangles. Eachchild rectangle r′ is inserted into H if its upper boundscore is above γ. We then decrement Imax (at Line 10). Atthe end (Lines 11–14), if the heap H is not empty, thenalgorithm returns the key value of H’s top entry as theupper bound score. Such a value is guaranteed to be themaximum upper bound value in the heap. Otherwise(i.e., empty H), the algorithm returns γ as the upperbound score because all rectangles with score below γhave been pruned.

Algorithm 7 FJ Upper Bound Computation Algorithmalgorithm FJ Influence(Value ε, Value γ, Value Imax, En-tries f1, f2, · · · , fm)

1: H :=new max-heap (with upper bound score as key);2: let DOM be the rectangle of the spatial domain;3: insert 〈DOM,

∑mc=1 ω(fc)〉 into H ;

4: while Imax > 0 and H is not empty do5: deheap a rectangle r from H ;6: partition r into four child rectangles;7: for each child rectangle r′ of r do8: if

∑mc=1 ω

inf (fc, r′) > γ then

9: insert 〈r′,∑mc=1 ω

inf (fc, r′)〉 into H ;

10: Imax:=Imax − 1;11: if H is not empty then12: return the key value of the top entry of H ;13: else14: return γ;

11

4.4 Extension to Generic Influence ScoresOur algorithms can be also applied to other types of in-fluence score functions. Given a function U : < → <, wemodel a score function as ωinf (s, p) = ω(s) ·U(dist(p, s)),where p is an object point and s ∈ Fc is a feature point.

Let e be a feature tree entry. The crux of our solutionis to re-define the upper bound function ωinf (e, p) (likein Equation 10) such that ωinf (e, p) ≥ ωinf (s, p), for anyfeature point s in the subtree of e.

In fact, the upper bound function can be expressedas ωinf (e, p) = ω(e) · U(d), where d is a dis-tance value. Observe that d must fall in the interval[mindist(p, e),maxdist(p, e)]. Thus, we apply a numer-ical method (e.g., the bisection method) to find thevalue d ∈ [mindist(p, e),maxdist(p, e)] that maximizesthe value of U .

For the special case that U is a monotonic decreasingfunction, we can simply set d = mindist(p, e) because itdefinitely maximizes the value of U .

5 EXPERIMENTAL EVALUATION

In this section, we compare the efficiency of the pro-posed algorithms using real and synthetic datasets. Eachdataset is indexed by an aR-tree with 4K bytes page size.We used an LRU memory buffer whose default size isset to 0.5% of the sum of tree sizes (for the object andfeature trees used). Our algorithms were implemented inC++ and experiments were run on a Pentium D 2.8GHzPC with 1GB of RAM. In all experiments, we measureboth the I/O cost (in number of page faults) and the totalexecution time (in seconds) of our algorithms. Section 5.1describes the experimental settings. Sections 5.2 and 5.3study the performance of the proposed algorithms forqueries with range scores and influence scores respec-tively. We then present our experimental findings on realdata in Section 5.4.

5.1 Experimental SettingsWe used both real and synthetic data for the experi-ments. The real datasets will be described in Section 5.4.For each synthetic dataset, the coordinates of points arerandom values uniformly and independently generatedfor different dimensions. By default, an object datasetcontains 200K points and a feature dataset contains100K points. The point coordinates of all datasets arenormalized to the 2D space [0, 10000]2.

For a feature dataset Fc, we generated qualities forits points such that they simulate a real world scenario:facilities close to (far from) a town center often havehigh (low) quality. For this, a single anchor point s? isselected such that its neighborhood region contains highnumber of points. Let distmin (distmax) be the minimum(maximum) distance of a point in Fc from the anchor s?.Then, the quality of a feature point s is generated as:

ω(s) = (distmax − dist(s, s?)distmax − distmin

)ϑ (12)

where dist(s, s?) stands for the distance between s ands?, and ϑ controls the skewness (default: 1.0) of qualitydistribution. In this way, the qualities of points in Fc liein [0, 1] and the points closer to the anchor have higherqualities. Also, the quality distribution is highly skewedfor large values of ϑ.

We study the performance of our algorithms withrespect to various parameters, which are displayed inTable 2 (their default values are shown in bold). In eachexperiment, only one parameter varies while the othersare fixed to their default values.

TABLE 2Range of parameter values

Parameter ValuesAggregate function SUM, MIN, MAX

Buffer size (%) 0.1, 0.2, 0.5, 1, 2, 5, 10Object data size, |D| (×1000) 100, 200, 400, 800, 1600

Feature data size, |F| (×1000) 50, 100, 200, 400, 800Number of results, k 1, 2, 4, 8, 16, 32, 64

Number of features, m 1, 2, 3, 4, 5Query range, ε 10, 20, 50, 100, 200

5.2 Performance on Queries with Range Scores

This section studies the performance of our algorithmsfor top-k spatial preference queries on range scores.

Table 3 shows the I/O cost and execution time of thealgorithms, for different aggregate functions (SUM, MIN,MAX). GP has lower cost than SP because GP computesthe scores of points within the same leaf node con-currently. The incremental computation technique (usedby SP and GP) derives a tight upper bound score (ofeach point) for the MIN function, a partially tight boundfor SUM, and a loose bound for MAX (see Section 3.6).This explains the performance of SP and GP acrossdifferent aggregate functions. However, the cost of theother methods are mainly influenced by the effectivenessof pruning. BB employs an effective technique to pruneunqualified non-leaf entries in the object tree so it out-performs GP. The optimized score computation methodenables BB* to save on average 20% I/O and 30% timeof BB. FJ outperforms its competitors as it discoversqualified combination of feature entries early.

We ignore SP in subsequent experiments, and com-pare the cost of the remaining algorithms on syntheticdatasets with respect to different parameters.

TABLE 3Effect of the aggregate function, range scores

SUM function SP GP BB BB* FJI/O 350927 22594 2033 1535 489

Time (s) 635.0 32.7 3.0 2.0 1.3

MIN function SP GP BB BB* FJI/O 235602 16254 611 615 47

Time (s) 426.8 22.7 0.9 0.8 0.2

MAX function SP GP BB BB* FJI/O 402704 26128 228 186 8

Time (s) 742.8 38.2 0.3 0.2 0.1

12

Next, we empirically justify the choice of using level-1entries of feature trees Fc for the upper bound score com-putation routine in the BB algorithm (see Section 3.3). Inthis experiment, we use the default parameter settingand study how the number of node accesses of BB isaffected by the level of Fc used. Table 4 shows thedecomposition of node accesses over the tree D andthe trees Fc, and the statistics of upper bound scorecomputation. Each accessed non-leaf node of D invokesa call of the upper bound score computation routine.

When level-0 entries of Fc are used, each upper boundcomputation call incurs a high number (617.5) of nodeaccesses (of Fc). On the other hand, using level-2 en-tries for upper bound computation leads to very loosebounds, making it difficult to prune the leaf nodes ofD. Observe that the total cost is minimized when level-1entries (of Fc) are used. In that case, the node accessesper upper bound computation call is low (15), and yetthe obtained bounds are tight enough for pruning mostleaf nodes of D.

TABLE 4Effect of the level of Fc used for upper bound score

computation in the BB algorithmLevel Node accesses (NA) Upper bound score computation

Total of D of Fc # of calls NA of Fc per call0 3350 53 3297 4 617.51 2365 130 2235 4 152 13666 930 12736 14 2

Figure 8 plots the cost of the algorithms as a functionof the buffer size. As the buffer size increases, the I/Oof all algorithms drops. FJ remains the best method, BB*the second, and BB the third; all of them outperform GPby a wide margin. Since the buffer size does not affectthe pruning effectiveness of the algorithms, it has a smallimpact on the execution time.

Figure 9 compares the cost of the algorithms withrespect to the object data size |D|. Since the cost of FJis dominated by the cost of joining feature datasets, itis insensitive to |D|. On the other hand, the cost of theother methods (GP, BB, BB*) increases with |D|, as scorecomputations need to be done for more objects in D.

Figure 10 plots the I/O cost of the algorithms with re-spect to the feature data size |F| (of each feature dataset).As |F| increases, the cost of GP, BB, and FJ increases. Incontrast, BB* experiences a slight cost reduction as itsoptimized score computation method (for objects andnon-leaf entries) is able to perform pruning early at alarge |F| value.

Figure 11 plots the cost of the algorithms with respectto the number m of feature datasets. The costs of GP,BB, and BB* increase linearly as m because the numberof component score computations is at most linear to m.On the other hand, the cost of FJ increases significantlywith m, because the number of qualified combinationsof entries is exponential to m.

Figure 12 shows the cost of the algorithms as a func-tion of the number k of requested results. GP, BB, andBB* compute the scores of objects in D in batches, so

their performance is insensitive to k. As k increases, FJhas weaker pruning power and its cost increases slightly.

Figure 13 shows the cost of the algorithms, when vary-ing the query range ε. As ε increases, all methods accessmore nodes in feature trees to compute the scores of thepoints. The difference in execution time between BB* andFJ shrinks as ε increases. In summary, although FJ is theclear winner in most of the experimental instances, itsperformance is significantly affected by the number mof feature datasets. BB* is the most robust algorithm toparameter changes and it is recommended for problemswith large m.

1.0e2

1.0e3

1.0e4

1.0e5

0.1 0.2 0.5 1 2 5 10

I/O

Buffer size (%)

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

0.1 0.2 0.5 1 2 5 10

Tim

e (

s)

Buffer size (%)

GPBB

BB*FJ

(a) I/O (b) time

Fig. 8. Effect of buffer size, range scores

1.0e2

1.0e3

1.0e4

1.0e5

1.0e6

100 200 400 800 1600

I/O

Object data size (x 1000)

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

1.0e3

100 200 400 800 1600

Tim

e (

s)

Object data size (x 1000)

GPBB

BB*FJ

(a) I/O (b) time

Fig. 9. Effect of |D|, range scores

1.0e2

1.0e3

1.0e4

1.0e5

50 100 200 400 800

I/O

Feature data size (x 1000)

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

1.0e3

50 100 200 400 800

Tim

e (

s)

Feature data size (x 1000)

GPBB

BB*FJ

(a) I/O (b) time

Fig. 10. Effect of |F|, range scores

5.3 Performance on Queries with Influence ScoresWe proceed to examine the cost of our algorithms fortop-k spatial preference queries on influence scores.

Figure 14 compares the cost of the algorithms withrespect to the number m of feature datasets. The costfollows the trend in Figure 11. Again, the number ofcombinations examined by FJ increases exponentiallywith m so its cost increases rapidly.

Figure 15 plots the cost of the algorithms by varyingthe number k of requested results. Observe that FJ

13

1.0e0

1.0e1

1.0e2

1.0e3

1.0e4

1.0e5

1.0e6

1 2 3 4 5

I/O

m

GPBB

BB*FJ

1.0e-2

1.0e-1

1.0e0

1.0e1

1.0e2

1.0e3

1 2 3 4 5

Tim

e (

s)

m

GPBB

BB*FJ

(a) I/O (b) time

Fig. 11. Effect of m, range scores

1.0e2

1.0e3

1.0e4

1.0e5

1 2 4 8 16 32 64

I/O

k

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

1 2 4 8 16 32 64

Tim

e (

s)

k

GPBB

BB*FJ

(a) I/O (b) time

Fig. 12. Effect of k, range scores

becomes more expensive than BB* (in both I/O and time)when the value of k is beyond 8. This is attributed totwo reasons. First, FJ incurs extra computational costas it needs to invoke Algorithm 7 for computing theupper bound score of a combination of feature entries.Second, FJ incurs high I/O cost to identify objects in Dthat produce high scores with the current combinationof features.

Figure 16 shows the cost of the algorithms as a func-tion of the parameter ε. Interestingly, the trend hereis different from the one in Figure 13. According toEquation 8, when ε decreases, the influence score alsodecreases, rendering it more difficult to distinguish thescores among different objects. Thus, the cost of BB, BB*,and FJ becomes high at a low ε value. Summing up, forthe newly introduced influence score, FJ is more sensitiveto parameter changes and it loses to BB* not only whenthere are multiple feature datasets, but also at large k.

5.4 Results on real dataIn this section, we conduct experiments on real objectand feature datasets in order to demonstrate the appli-

1.0e2

1.0e3

1.0e4

1.0e5

10 20 50 100 200

I/O

Epsilon

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

10 20 50 100 200

Tim

e (

s)

Epsilon

GPBB

BB*FJ

(a) I/O (b) time

Fig. 13. Effect of ε, range scores

1.0e1

1.0e2

1.0e3

1.0e4

1.0e5

1.0e6

1 2 3 4 5

I/O

m

GPBB

BB*FJ

1.0e-1

1.0e0

1.0e1

1.0e2

1.0e3

1.0e4

1.0e5

1 2 3 4 5

Tim

e (

s)

m

GPBB

BB*FJ

(a) I/O (b) time

Fig. 14. Effect of m, influence scores

1.0e3

1.0e4

1.0e5

1 2 4 8 16 32 64

I/O

k

GPBB

BB*FJ

1.0e1

1.0e2

1.0e3

1 2 4 8 16 32 64

Tim

e (

s)

k

GPBB

BB*FJ

(a) I/O (b) time

Fig. 15. Effect of k, influence scores

cation of top-k spatial preference queries.We obtained three real spatial datasets from a travel

portal website, http://www.allstays.com/. Loca-tions in these datasets correspond to (longitude, lati-tude) coordinates in US. We cleaned the datasets bydiscarding records without longitude and latitude. Eachremaining location is normalized to a point in the 2Dspace [0, 10000]2. One dataset is used as the object datasetand the other two are used as feature datasets. Theobject dataset D contains 11399 camping locations. Thefeature dataset F1 contains 30921 hotel records, eachwith a room price (quality) and a location. The featuredataset F2 has 3848 records of Wal-Mart stores, eachwith a gasoline availability (quality) and a location.The domain of each quality attribute (e.g., room price,gasoline availability) is normalized to the unit interval[0, 1]. Intuitively, a camping location is considered asgood if it is close to a Wal-Mart store with high gasolineavailability (i.e., convenient supply) and a hotel withhigh room price (which indirectly reflects the quality ofnearby outdoor environment).

Figure 17 plots the cost of the algorithms with respect

1.0e2

1.0e3

1.0e4

1.0e5

10 20 50 100 200

I/O

Epsilon

GPBB

BB*FJ

1.0e1

1.0e2

1.0e3

10 20 50 100 200

Tim

e (

s)

Epsilon

GPBB

BB*FJ

(a) I/O (b) time

Fig. 16. Effect of ε, influence scores

14

to ε, for queries with range scores. At a very small εvalue, most of the objects have the zero score as theyhave no feature points within their neighborhood. Thisforces BB, BB*, and FJ to access a larger number of objects(or feature combinations) before finding an object withnon-zero score, which can then be used for pruning otherunqualified objects.

Figure 18 compares the cost of the algorithms withrespect to ε, for queries with influence scores. In general,the cost follows the trend in Figure 16. BB* outperformsBB at low ε value whereas BB incurs a slightly lowercost than BB* at a high ε value. Observe that the cost ofBB and BB* is close to that of FJ when ε is sufficientlyhigh. In summary, the relative performance between thealgorithms in all experiments is consistent to the resultson synthetic data.

1.0e0

1.0e1

1.0e2

1.0e3

1.0e4

1.0e5

1.0e6

0.01 0.1 1 10

I/O

Epsilon

GPBB

BB*FJ

1.0e-2

1.0e-1

1.0e0

1.0e1

1.0e2

1.0e3

0.01 0.1 1 10

Tim

e (

s)

Epsilon

GPBB

BB*FJ

(a) I/O (b) time

Fig. 17. Effect of ε, range scores, real data

1.0e2

1.0e3

1.0e4

1.0e5

0.01 0.1 1 10

I/O

Epsilon

GPBB

BB*FJ

1.0e0

1.0e1

1.0e2

1.0e3

0.01 0.1 1 10

Tim

e (

s)

Epsilon

GPBB

BB*FJ

(a) I/O (b) time

Fig. 18. Effect of ε, influence scores, real data

6 CONCLUSION

In this paper, we studied top-k spatial preference queries,which provides a novel type of ranking for spatial objectsbased on qualities of features in their neighborhood. Theneighborhood of an object p is captured by the scoringfunction: (i) the range score restricts the neighborhoodto a crisp region centered at p, whereas (ii) the influencescore relaxes the neighborhood to the whole space andassigns higher weights to locations closer to p.

We presented five algorithms for processing top-kspatial preference queries. The baseline algorithm SPcomputes the scores of every object by querying onfeature datasets. The algorithm GP is a variant of SP thatreduces I/O cost by computing scores of objects in thesame leaf node concurrently. The algorithm BB derivesupper bound scores for non-leaf entries in the object tree,and prunes those that cannot lead to better results. Thealgorithm BB* is a variant of BB that utilizes an opti-mized method for computing the scores of objects (and

upper bound scores of non-leaf entries). The algorithmFJ performs a multi-way join on feature trees to obtainqualified combinations of feature points and then searchfor their relevant objects in the object tree.

Based on our experimental findings, BB* is scalableto large datasets and it is the most robust algorithmwith respect to various parameters. However, FJ is thebest algorithm in cases where the number m of featuredatasets is low and each feature dataset is small.

In the future, we will study the top-k spatial preferencequery on road network, in which the distance betweentwo points is defined by their shortest path distancerather than their Euclidean distance. The challenge is todevelop alternative methods for computing the upperbound scores for a group of points on a road network.

REFERENCES[1] M. L. Yiu, X. Dai, N. Mamoulis, and M. Vaitis, “Top-k Spatial

Preference Queries,” in ICDE, 2007.[2] N. Bruno, L. Gravano, and A. Marian, “Evaluating Top-k Queries

over Web-accessible Databases,” in ICDE, 2002.[3] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial

Searching,” in SIGMOD, 1984.[4] G. R. Hjaltason and H. Samet, “Distance Browsing in Spatial

Databases,” TODS, vol. 24(2), pp. 265–318, 1999.[5] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis

and performance study for similarity-search methods in high-dimensional spaces.” in VLDB, 1998.

[6] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “Whenis “nearest neighbor” meaningful?” in ICDT, 1999.

[7] R. Fagin, A. Lotem, and M. Naor, “Optimal Aggregation Algo-rithms for Middleware,” in PODS, 2001.

[8] I. F. Ilyas, W. G. Aref, and A. Elmagarmid, “Supporting Top-kJoin Queries in Relational Databases,” in VLDB, 2003.

[9] N. Mamoulis, M. L. Yiu, K. H. Cheng, and D. W. Cheung,“Efficient Top-k Aggregation of Ranked Inputs,” ACM TODS,vol. 32, no. 3, p. 19, 2007.

[10] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao, “Efficient OLAPOperations in Spatial Data Warehouses,” in SSTD, 2001.

[11] S. Hong, B. Moon, and S. Lee, “Efficient Execution of Range Top-k Queries in Aggregate R-Trees,” IEICE Transactions, vol. 88-D,no. 11, pp. 2544–2554, 2005.

[12] T. Xia, D. Zhang, E. Kanoulas, and Y. Du, “On Computing Top-tMost Influential Spatial Sites,” in VLDB, 2005.

[13] Y. Du, D. Zhang, and T. Xia, “The Optimal-Location Query,” inSSTD, 2005.

[14] D. Zhang, Y. Du, T. Xia, and Y. Tao, “Progessive Computation ofThe Min-Dist Optimal-Location Query,” in VLDB, 2006.

[15] Y. Chen and J. M. Patel, “Efficient Evaluation of All-Nearest-Neighbor Queries,” in ICDE, 2007.

[16] P. G. Yokesh Kumar, Ravi Janardan, “Efficient Algorithms forReverse Proximity Query Problems,” in ACM GIS, 2008.

[17] M. L. Yiu, P. Karras, and N. Mamoulis, “Ring-Constrained Join:Deriving Fair Middleman Locations from Pointsets via a Geomet-ric Constraint,” in EDBT, 2008.

[18] M. L. Yiu, N. Mamoulis, and P. Karras, “Common Influence Join:A Natural Join Operation for Spatial Pointsets,” in ICDE, 2008.

[19] Y.-Y. Chen, T. Suel, and A. Markowetz, “Efficient Query Process-ing in Geographic Web Search Engines,” in SIGMOD, 2006.

[20] V. S. Sengar, T. Joshi, J. Joy, S. Prakash, and K. Toyama, “RobustLocation Search from Text Queries,” in ACM GIS, 2007.

[21] S. Berchtold, C. Boehm, D. Keim, and H. Kriegel, “A Cost Modelfor Nearest Neighbor Search in High-Dimensional Data Space,”in PODS, 1997.

[22] E. Dellis, B. Seeger, and A. Vlachou, “Nearest Neighbor Search onVertically Partitioned High-Dimensional Data,” in DaWaK, 2005,pp. 243–253.

[23] N. Mamoulis and D. Papadias, “Multiway Spatial Joins,” TODS,vol. 26(4), pp. 424–475, 2001.

[24] A. Hinneburg and D. A. Keim, “An Efficient Approach to Clus-tering in Large Multimedia Databases with Noise,” in KDD, 1998.

15

Man Lung Yiu received the bachelors degreein computer engineering and the PhD degree incomputer science from the University of HongKong in 2002 and 2006, respectively. Prior to hiscurrent post, he worked at Aalborg University forthree years starting in the Fall of 2006. He isnow an assistant professor in the Departmentof Computing, Hong Kong Polytechnic Univer-sity. His research focuses on the managementof complex data, in particular query processingtopics on spatiotemporal data and multidimen-

sional data.

Hua Lu received the BSc and MSc degreesfrom Peking University, China, in 1998 and 2001,respectively and the PhD degree in computerscience from National University of Singaporein 2007. He is currently an Assistant Profes-sor in the Department of Computer Science,Aalborg University, Denmark. His research in-terests include skyline queries, spatio-temporaldatabases, geographic information systems, andmobile computing. He is a member of the IEEE.

Nikos Mamoulis received a diploma in Com-puter Engineering and Informatics in 1995 fromthe University of Patras, Greece, and a PhD inComputer Science in 2000 from the Hong KongUniversity of Science and Technology. He is cur-rently an associate professor at the Departmentof Computer Science, University of Hong Kong,which he joined in 2001. In the past, he hasworked as a research and development engi-neer at the Computer Technology Institute, Pa-tras, Greece and as a post-doctoral researcher

at the Centrum voor Wiskunde en Informatica (CWI), the Netherlands.During 2008-2009 he was on leave to the Max-Planck Institute forInformatics (MPII), Germany. His research focuses on the managementand mining of complex data types, including spatial, spatio-temporal,object-relational, multimedia, text and semi-structured data. He hasserved on the program committees of over 70 international conferencesand workshops on data management and data mining. He was thegeneral chair of SSDBM 2008, the PC chair of SSTD 2009, and heorganized the SSTDM 2006 and DBRank 2009 workshops. He hasserved as PC vice chair of ICDM 2007, ICDM 2008, and CIKM 2009. Hewas the publicity chair of ICDE 2009. He is an editorial board memberfor Geoinformatica Journal and was a field editor of the Encyclopedia ofGeographic Information Systems.

Michail Vaitis holds a Diploma (1992) and aPh.D. (2001) in Computer Engineering and In-formatics from the University of Patras, Greece.He was collaborating for five years with theComputer Technology Institute (CTI), Greece, asa research and development engineer, workingon hypertext and database systems. Now heis an assistant professor at the Department ofGeography, University of the Aegean, Greece,which he joined in 2003. His research interestsinclude geographical databases, spatial data in-

frastructures, and hypermedia models and services.

1 ranking spatial data by quality preferences -...

Documents