ieee transactions on knowledge and data engineering,...

14
Time-Aware Boolean Spatial Keyword Queries Gang Chen, Jingwen Zhao, Yunjun Gao, Member, IEEE, Lei Chen, Senior Member, IEEE, and Rui Chen Abstract—With advances in geo-positioning technologies and mobile internet, location-based services have attracted much attention, and spatial keyword queries are catching on fast. However, as far as we aware, no prior work considers the temporal information of geo- tagged objects. Temporal information is important in the spatial keyword query because many objects are not always valid. For example, visitors may plan their trips according to the opening time of attractions. In this paper, we identifyand solve a novel problem, i.e., the time-aware Boolean spatial keyword query (TABSKQ), which returns the k objects that satisfy users’ spatio-temporal description and textual constraint. We first present pruning strategies and algorithm based on the CIR þ -tree (i.e., the CIR-tree with temporal information). Then, we propose an efficient index structure, called the TA-tree, and its corresponding algorithms, which can prune the search space using both spatio-temporal and textual information. Furthermore, we study an interesting TABSKQ variant, i.e., Joint TABSKQ (JTABSKQ), which aims to process a set of TABSKQs jointly, and extend our techniques to tackle it. Extensive experiments with real datasets offer insight into the performance of our proposed indices and algorithms. Index Terms—Boolean spatial keyword query, index structure, query processing, algorithm Ç 1 INTRODUCTION D UE to the popularity of geo-positioning technologies, the spatial keyword query has received much attention from both industry and research communities. Basically, it takes as inputs a query location and a set of keywords, and returns the matching objects. In general, there are two types of spatial keyword queries, namely, ranked spatial keyword queries [1], [2], [3], [4], [5] and Boolean spatial keyword queries [6], [7], [8], [9], [10]. A ranked spatial keyword query retrieves the objects according to a ranking function that considers both textual relevance and spatial proximity, while a Boolean spatial keyword query aims to find a set of spatio-textual objects each of which contains all the query keywords and that are the nearest to the query location. Existing efforts mostly focus on retrieving the objects with matching spatial and textual information. However, to our knowledge, no prior work has taken into account tem- poral information, which plays an important role in spatial keyword queries. For instance, visitors may plan to their trips according to the opening time of attractions. An exam- ple is illustrated in Fig. 1, where a tourist (at q in Fig. 1) would like to visit a “museum” in “Phoenix” during 14:00 and 16:00, and to maximize his/her visiting time. In this case, although o 2 is the closest object to q and meanwhile contains the keywords “Museum” and “Phoenix”, its valid time during the query time interval is shorter than o 1 ’s. Thus, o 1 is a better choice than o 2 for q. Note that, in this paper, we aim at the keyword containment (i.e., Boolean) case, because the number of query keywords is generally small [11] and users may prefer the objects to contain all the query keywords. For example, in Fig. 1, o 3 only contains keyword “Phoenix” and hence cannot satisfy q’s require- ment. The Boolean spatial keyword query also has its own application base [7], [9], [12] (e.g., local search engines, etc.). In addition, considering that the objects with sufficient valid time and containing all the query keywords could be too far away to get there, the returned objects should be within a search radius [13]. In this paper, we explore a new form of top-k Boolean spatial keyword queries, i.e., time-aware Boolean spatial key- word query (TABSKQ), which retrieves a set of the k objects that meet the requirement of TABSKQ specified by a set of keywords and a spatio-temporal description. The answer objects for TABSKQ need to satisfy: (1) Each object contains all the query keywords; (2) every object is located in the search radius; and (3) the objects are ranked the highest according to a ranking function that combines their distan- ces to a query location and the length of the overlap between query time interval and object opening time. In some real-life applications, users may also prefer to provide a time stamp instead of a time interval. As an exam- ple, a user would like to find the nearest beer-bar that is open at 1:00 am. In this case, the returned answer objects should be valid at the query time stamp, which is a variant of TABSKQ. It is worth mentioning that, the proposed tech- niques in this paper can also efficiently process this query variation. A straightforward approach to address TABSKQ is to first retrieve the nearest objects. Then, it checks their textual and temporal information in order to choose the k objects as the result. Unfortunately, this method might return many unqualified objects in the first step, which degrades perfor- mance. In this paper, we propose a pruning strategy which G. Chen, J. Zhao, Y. Gao, and R. Chen are with the College of Computer Science, Zhejiang University, 38 Zheda Road, Hangzhou 310027, P. R. China. E-mail: {cg, zhaojw, gaoyj, cr}@zju.edu.cn. L. Chen is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, P. R. China. E-mail: [email protected]. Manuscript received 7 May 2016; revised 28 June 2017; accepted 13 Aug. 2017. Date of publication 22 Aug. 2017; date of current version 4 Oct. 2017. (Corresponding author: Yunjun Gao.) Recommended for acceptance by D. Barbosa. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2017.2742956 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017 2601 1041-4347 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 15-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

Time-Aware Boolean Spatial Keyword QueriesGang Chen, Jingwen Zhao, Yunjun Gao,Member, IEEE, Lei Chen, Senior Member, IEEE, and Rui Chen

Abstract—With advances in geo-positioning technologies and mobile internet, location-based services have attracted much attention,

and spatial keyword queries are catching on fast. However, as far as we aware, no prior work considers the temporal information of geo-

tagged objects. Temporal information is important in the spatial keyword query because many objects are not always valid. For

example, visitors may plan their trips according to the opening time of attractions. In this paper, we identify and solve a novel problem,

i.e., the time-aware Boolean spatial keyword query (TABSKQ), which returns the k objects that satisfy users’ spatio-temporal

description and textual constraint. We first present pruning strategies and algorithm based on the CIRþ-tree (i.e., the CIR-tree with

temporal information). Then, we propose an efficient index structure, called the TA-tree, and its corresponding algorithms, which can

prune the search space using both spatio-temporal and textual information. Furthermore, we study an interesting TABSKQ variant, i.e.,

Joint TABSKQ (JTABSKQ), which aims to process a set of TABSKQs jointly, and extend our techniques to tackle it. Extensive

experiments with real datasets offer insight into the performance of our proposed indices and algorithms.

Index Terms—Boolean spatial keyword query, index structure, query processing, algorithm

Ç

1 INTRODUCTION

DUE to the popularity of geo-positioning technologies,the spatial keyword query has received much attention

from both industry and research communities. Basically, ittakes as inputs a query location and a set of keywords, andreturns the matching objects. In general, there are two typesof spatial keyword queries, namely, ranked spatial keywordqueries [1], [2], [3], [4], [5] and Boolean spatial keywordqueries [6], [7], [8], [9], [10]. A ranked spatial keyword queryretrieves the objects according to a ranking function thatconsiders both textual relevance and spatial proximity,while a Boolean spatial keyword query aims to find a set ofspatio-textual objects each of which contains all the querykeywords and that are the nearest to the query location.

Existing efforts mostly focus on retrieving the objectswith matching spatial and textual information. However, toour knowledge, no prior work has taken into account tem-poral information, which plays an important role in spatialkeyword queries. For instance, visitors may plan to theirtrips according to the opening time of attractions. An exam-ple is illustrated in Fig. 1, where a tourist (at q in Fig. 1)would like to visit a “museum” in “Phoenix” during 14:00and 16:00, and to maximize his/her visiting time. In thiscase, although o2 is the closest object to q and meanwhilecontains the keywords “Museum” and “Phoenix”, its validtime during the query time interval is shorter than o1’s.

Thus, o1 is a better choice than o2 for q. Note that, in thispaper, we aim at the keyword containment (i.e., Boolean)case, because the number of query keywords is generallysmall [11] and users may prefer the objects to contain all thequery keywords. For example, in Fig. 1, o3 only containskeyword “Phoenix” and hence cannot satisfy q’s require-ment. The Boolean spatial keyword query also has its ownapplication base [7], [9], [12] (e.g., local search engines, etc.).In addition, considering that the objects with sufficient validtime and containing all the query keywords could be too faraway to get there, the returned objects should be within asearch radius [13].

In this paper, we explore a new form of top-k Booleanspatial keyword queries, i.e., time-aware Boolean spatial key-word query (TABSKQ), which retrieves a set of the k objectsthat meet the requirement of TABSKQ specified by a set ofkeywords and a spatio-temporal description. The answerobjects for TABSKQ need to satisfy: (1) Each object containsall the query keywords; (2) every object is located in thesearch radius; and (3) the objects are ranked the highestaccording to a ranking function that combines their distan-ces to a query location and the length of the overlapbetween query time interval and object opening time.

In some real-life applications, users may also prefer toprovide a time stamp instead of a time interval. As an exam-ple, a user would like to find the nearest beer-bar that isopen at 1:00 am. In this case, the returned answer objectsshould be valid at the query time stamp, which is a variantof TABSKQ. It is worth mentioning that, the proposed tech-niques in this paper can also efficiently process this queryvariation.

A straightforward approach to address TABSKQ is to firstretrieve the nearest objects. Then, it checks their textual andtemporal information in order to choose the k objects as theresult. Unfortunately, this method might return manyunqualified objects in the first step, which degrades perfor-mance. In this paper, we propose a pruning strategy which

� G. Chen, J. Zhao, Y. Gao, and R. Chen are with the College of ComputerScience, Zhejiang University, 38 Zheda Road, Hangzhou 310027, P. R.China. E-mail: {cg, zhaojw, gaoyj, cr}@zju.edu.cn.

� L. Chen is with the Department of Computer Science and Engineering,Hong Kong University of Science and Technology, Hong Kong, P. R.China. E-mail: [email protected].

Manuscript received 7 May 2016; revised 28 June 2017; accepted 13 Aug.2017. Date of publication 22 Aug. 2017; date of current version 4 Oct. 2017.(Corresponding author: Yunjun Gao.)Recommended for acceptance by D. Barbosa.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2017.2742956

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017 2601

1041-4347� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

can shrink the search space using keywords and temporalconstraint simultaneously. Based on the pruning strategy, wepresent an efficient index structure, termed as TA-tree, to fur-ther enhance pruning efficiency. Using the TA-tree, we pro-pose pruning strategies and algorithms for TABSKQ. To sumup, the key contributions of this paper are as follows:

� We formalize the problem of TABSKQ. To the best ofour knowledge, there is no previous work onTABSKQ.

� We present pruning strategies and a baseline algo-rithm for supporting TABSKQ based on the CIRþ-tree (i.e., the CIR-tree with temporal information).

� We design a novel and hybrid index structure, i.e.,TA-tree, and propose efficient pruning strategiesand algorithms to tackle TABSKQ.

� We extend our techniques to handle an interestingvariant of TABSKQ, i.e., joint TABSKQ (JTABSKQ),which aims at jointly processing a set of TABSKQs.

� We conduct extensive experiments with real datasets to offer insight into the performance of our pro-posed indices and algorithms.

The rest of the paper is organized as follows. Section 2reviews related work. Section 3 formulates the problem ofTABSKQ. Section 4 presents the CIRþ-tree with the algorithmfor TABSKQ. Section 5 elaborates the TA-tree and its corre-sponding algorithms for answering TABSKQ. Section 6extends ourmethods to tackle JTABSKQ. Considerable exper-imental results and our findings are reported in Section 7.Finally, Section 8 concludes the paper with some directionsfor futurework.

2 RELATED WORK

In this section, we overview the existing work related tospatial keyword queries and temporal information retrieval,respectively.

2.1 Spatial Keyword Queries

Spatial keyword queries have received lots of attention inrecent years. They can be roughly divided into Boolean spa-tial keyword queries and ranked spatial keyword queries.Existing efforts on Boolean spatial keyword queries include[6], [7], [12], [14], and those on ranked spatial keywordqueries contain [1], [3], [15]. [16] presents a comprehensiveexperimental evaluation for different spatial keyword queryindices and query processing techniques.

Recently, many spatial keyword query variants have alsobeen studied (see [17] for a comprehensive survey). Thereare joint spatial keyword query [9], collective spatial

keyword query [11], [18], [19], continuously moving spatialkeyword query [20], [21], reverse spatial and textual k near-est neighbor search [22], [23], reverse top-k geo-social key-word query [24], spatial keyword query over streamingdata [25], [26], main-memory spatial keyword query proc-essing [13], why-not questions on spatial keyword queries[27], to name but a few.

However, all the above query models consider only thespatio-textual information but not the valid time of theobjects, and hence, they cannot support TABSKQ.

2.2 Temporal Information Retrieval

Temporal information retrieval has been a topic of interestin the literature (cf. [28] for a good survey). To name just afew, Berberich et al. [29] propose a solution for time-traveltext search by extending the inverted file index; Anandet al. [30] present incremental sharding technique for index-ing web archives; He and Suel [31] use index partition tech-niques that consider index compression to achieve highquery throughput; Manica et al. [32] propose a comprehen-sive study of search engines on the exploitation of temporalinformation. It is worth mentioning that, these studies con-sider only the textual and temporal information but not thespatial proximity, and thus, they cannot handle TABSKQ.

3 PROBLEM FORMULATION

We first detail our settings and then define the TABSKQ.Table 1 summarizes the symbols used frequently through-out this paper.

Given a dataset D ¼ fo1; o2; . . .g. An object o in D isdenoted as a tuple (o:loc, o:key, o:t), in which o:loc is a spatiallocation, o:key is a set of keywords, and o:t is the valid timeof o in the form of (o:st, o:et) with o:st and o:et being thestarting time stamp and the ending time stamp, respec-tively. In this paper, a short period of time is defined as atime unit, and any time stamp belongs to a time unit. Notethat, in the examples of the paper, we use an hour as a timeunit, and one day as a time period G, so that there are 24 dis-tinct time units in G. As an example, any time stampamongst [12:00, 13:00) can be converted to a time unit 12,and the valid time of o3 in Fig. 1 can be converted to (7, 18).

From the example depicted in Fig. 1, we observe thatusers may prefer the objects to have longer valid time.Assume that the query time interval is q:t, the temporal

overlap ratio tðq:t; o:tÞ is defined as: tðq:t; o:tÞ ¼ jq:tT

o:tjjq:tj . Spe-

cifically, tðq:t; o:tÞ is measured by the overlap between q:t

TABLE 1Symbols and Description

Notation Description

wl the lth most frequent keyword in the datasetPrðwlÞ the occurrence probability of a keyword wl

keyF ðkeyIÞ the frequent keyword (the infrequent keyword)Ni the TA-tree node pointed by the No.

i pointer of its parent nodeo:tist (o:t

iet) the keyword sensitive starting (end) time of o

N:size the number of the time units in a TA-tree nodeNN:fo the fanout of a TA-tree nodeNN:mins the minimal o:tist of the objects in a TA-tree nodeNN:maxe the maximal o:tiet of the objects in a TA-tree nodeN

Fig. 1. Example of a TABSKQ.

2602 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

and o:t (i.e., jq:tT o:tj), and we use jq:tj to normalize it. Next,we formally define TABSKQ.

Definition 1 (TABSKQ). Given a dataset D and a time-awareBoolean spatial keyword query (TABSKQ) q ¼ ðq:loc, q:key, q:t,q:r) where q:loc is a query location, q:key is a set of query key-words that represent user’s preferences [33], q:t is a query timeinterval, and q:r is a search radius. Let Dðq:keyÞ ={o 2 Djq:key � o:key} be the objects in D containing all thequery keywords. The result set SrðqÞ of q is a subset of Dðq:keyÞincluding the k objects such that 8o 2 SrðqÞ;Dstðq; oÞ < q:rand 8o 2 SrðqÞ ^ 8o0 2 ðDðq:keyÞ� SrðqÞÞ;FðoÞ � Fðo0Þ. Here

FðoÞ ¼ tðq:t; o:tÞ1þ a�Dstðq; oÞ ; (1)

in which Dstðq; oÞ is the euclidean distance from q to o, and a

is a parameter used to balance temporal overlap ratio and spa-tial distance.

Intuitively, if an object has a higher temporal overlapratio or is closer to the query location, it has a higher scorebased on Eq. (1). Moreover, if the query time is a timestamp, we measure the scores of the objects according totheir euclidean distances to the query location. The value ofa is assigned in the same way as in [34]. For instance, ifa ¼ 0, only temporal relevance is considered; and if a > 0,the importance of the spatial proximity increases. For easeof understanding, we fix a ¼ 1 in this paper, but the pro-posed algorithms remain applicable when a varies.

It is worth mentioning that, the linear combination mea-surement F0ðoÞ ¼ a0 � tðq:t; o:tÞ þ ð1� a0Þ � ð1� Dstðq;oÞ

maxDstÞ is

also an operational metric function for TABSKQ, wheremaxDst is the maximal distance between any two objects inthe dataset. In general, Ratio Combination (i.e., FðoÞ) [21],[34] and Linear Combination [1] are two popular rankingfunctions utilized in spatial keyword queries. We employFðoÞ instead of F0ðoÞ to avoid normalizing the distances,which is the requirement for using F0ðoÞ. Nonetheless, ourpresented approach can also support F0ðoÞ.

4 TABSKQ USING CIRþ-TREEIn this section, we first introduce briefly the CIRþ-tree.Then, we present the baseline TABSKQ using CIRþ-tree (TC)algorithm, and analyze its time complexity.

4.1 The CIRþ-TreeThe CIRþ-tree is built upon the CIR-tree [1] with temporalinformation, and the CIR-tree is an IR-tree [1] variantextended with object categories. The idea of the CIR-tree is togroup objects into clusters according to objects’ textual simi-larity, and the objects in the same cluster are more similar (w.r.t. textual description) than those in different clusters. Notethat, the proposed techniques in this section are also appliedto the IR-tree. However, as confirmed in [1], the CIR-tree ismore efficient compared with the IR-tree because it canreduce the number of false hits [14]. Here, we say that it is afalse hit if the non-leafnode R passes the test of the query butdoes not contain any object covering the query keywords.

We now detail the leafnode and non-leafnode of the CIRþ-tree. The leafnode is of the form (ln, ln:doc, ln:c, ln:t), in which

ln refers to an object, ln:doc is the textual description of ln, ln:cdenotes the cluster that ln belongs to, and ln:t is the valid timeof ln. For a non-leafnode R, it is of the form (cp, rec, R:UT ,{cp:cluster}, {cp:cluster:doc}), where cp represents the addressof R, rec is the minimum bounding rectangle (MBR) of R,R:UT is a set of the time intervals that are the union of thevalid time w.r.t. the objects in R, {cp:cluster} denotes a set ofcluster identifiers, and {cp:cluster:doc} is a set of documentidentifiers describing the textual information of the clusters in{cp:cluster}. In general, there are two new elements added inthe CIRþ-tree compared with the CIR-tree, i.e., the R:UT inthe non-leafnode and the ln:t in the leafnode.

4.2 TC Algorithm

Given a TABSKQ q, TC algorithm progressively retrievesthe nearest candidate objects with the help of the CIRþ-treeuntil no other objects are better than the candidate objects.During the search, it prunes away the objects using textualand temporal information respectively, and returns theobjects having the highest FðoÞ scores among the candidateobjects as the result. Two pruning strategies are also devel-oped to boost query performance. First, we present theupper bound of the score FðRÞ of a CIRþ-tree non-leafnodeR w.r.t. the query q, and judge if the objects in R are answerobjects. Second, we derive the upper bound � of the distan-ces between q and the objects in the dataset. If TC algorithmencounters an object o or a CIRþ-tree node with their distan-ces to q no less than �, it can stop the search since all theunfound objects cannot be the answer objects.

Lemma 1. Let a max-priority queue Result maintain the candi-date objects found so far, with the objects o 2 Result sorted indescending order of their FðoÞ values. Given a TABSKQ q anda CIRþ-tree non-leafnode R, the upper bound of score FðRÞ w.r.t. q can be computed as: FðRÞ ¼ tðq:t;R:UT Þ

1þa�Dstðq;RÞ, where Dstðq; RÞis the minimal euclidean distance between q and the MBR of R,and tðq:t; R:UT Þ is the temporal overlap ratio between q:t andR:UT . Then, if FðRÞ � FðokÞ (ok is the kth object in Result),all the objects in R can be pruned.

Proof. For any object o in R, we have Dstðq; RÞ � Dstðq; oÞ.Also, since R:UT is the union of the valid time for theobjects in R, we have FðoÞ � FðRÞ for 8o 2 R. IfFðRÞ � FðokÞ holds, none of the objects in R can be realanswer objects, and thus, they can be discarded. tuUsing Lemma 1, we can prune away the whole subtree

rooted by R if FðRÞ � FðokÞ. Next, we propose an early stopcondition.

Lemma 2. Given a TABSKQ q and a max-priority queue Resultstoring the candidate objects. If TC algorithm encounters an/aobject/node and its distance to q is no less than � ¼ ð1=FðokÞ �1Þ=a (ok is the kth object in Result), the algorithm terminatesthe search.

Proof. Assume that an object oi or a non-leafnode R (anyobject in R is denoted as oj) is to be visited, and its dis-tance to q is d (� �). The maximal scores of the objects areFðoiÞ ¼ 1=ð1þ a� dÞ � FðokÞ or FðojÞ � 1=ð1þ a� dÞ �FðokÞ according to Definition 1. In other words, the scoresof those objects are no larger than FðokÞ. In addition, sinceTC retrieves the nearest objects progressively using the

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2603

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

CIRþ-tree, the scores of all the unfound objects cannot belarger than FðokÞ and the algorithm can stop the searchaccordingly. tuNote that, � is initially set as q:r in TC algorithm, as the

objects with their distances larger than q:r cannot beanswer objects, according to Definition 1. Now we areready to present TC algorithm with its pseudo-code shownin Algorithm 1. In lines 1–2, two priority queues are initial-ized. Specifically, Queue is a min-priority queue to keeptrack of the nodes and objects in the CIRþ-tree that haveyet been visited, and Result is a max-priority queue usedfor storing the candidate objects. The key values of the twopriority queues are Dstðq; ElementÞ and FðoÞ, respectively.In lines 4–17, TC retrieves incrementally the nearest objectscontaining q:key. When an element is dequeued fromQueue, TC checks whether Dstðq; ElementÞ is smaller than�. If yes, TC proceeds to execute the following steps. If theelement is a non-leafnode and the keywords of its childnodes in the same cluster contain all the query keywords,the child nodes are added to Queue (lines 8-12). If the ele-ment is an object, the algorithm checks its keywords todetermine whether the object should be inserted intoResult (lines 13–17). Finally, the top-k objects in Result arereturned as the query result.

Algorithm 1. TABSKQ Using CIRþ-Tree (TC)

Input: a CIRþ-tree TC on a dataset D, a TABSKQ q ¼ ðq:loc;q:key; q:t; q:r), the number k of the objects requested, and a bal-anced parameter aOutput: the result set Sr of q1: Queue a new min-priority queue2: Result a new max-priority queue3: Sr ? and � ¼ q:r4: Queue.Enqueue(TC:RootNode,Dst(q, TC:RootNode))5: while Queue is not empty do6: Element Queue.Deuque( )7: ifDst(q, Element) < � then // Lemma 28: if Element is a non-leafnode then9: for each entry E in Element do10: if E:cluster:doc covers q:key then11: if FðEÞ > FðokÞ then // Lemma 112: Queue.Enqueue(E,Dst(q, E:rec))13: else if Element is an object o then14: if o:key covers q:key and FðoÞ > FðokÞ then15: delete ok from Queue16: Result.Enqueue(o, FðoÞ) and update �17: else break18: return Sr formed by the objects in Result

Example 1. A sample dataset is depicted in Fig. 2. Considera top-1 TABSKQ q = (q:loc, q:key = {key3, key5}, q:t = (16:00,19:00), q:r ¼ 1). Here, q:loc is a query location (i.e., a soliddot) in Fig. 2a. The euclidean distances from q to allobjects and MBRs in the CIRþ-tree are listed in Table 2.The root node R7 is first dequeued from Queue, and itschild nodes (R5 and R6) are inserted back into Queue. TCpops the non-leafnodes in ascending order of their distan-ces to q until an object is dequeued. o5 and o3 are the firsttwo objects dequeued. However, they are not valid dur-ing q:t. Next, o4 is dequeued, and it meets the requirementof q. Thus, we have Fðo4Þ = 0.33/(1 + 0.9) = 0.174 and � =4.75. o7 is dequeued later. As Fðo7Þ = 0.67/(1 + 1.2) =0.305 � Fðo4Þ, o7 is updated as the current answer objectand � = 2.28. The next dequeued elements are R1 and R4.Since Fðo7Þ ¼ 0:305 > FðR1Þ ¼ 0 and Fðo7Þ > FðR4Þ ¼0, R1 and R4 are pruned away according to Lemma 1. AsQueue is empty, the algorithm terminates, and fo7g isreturned as the final result of q.

4.3 Discussion

In this section, we analyze the time complexity of TC algo-rithm. Given an object o and a TABSKQ q, o is a candidateobject for q when (i) o:key contains q:key, and (ii) q:t overlapswith o:t. Let the probabilities of the two events be the key-word containment probability Prðo:key � q:keyÞ and the tempo-ral overlap probability Prðq:t \ o:t 6¼ ?Þ. The probability of obeing a candidate object for q is PrcanðoÞ ¼ Prðo:key �q:keyÞ � Prðq:t \ o:t 6¼ ?Þ. Without loss of generality, weassume that q:r ¼ 1.

Then, we present the keyword containment probabilityand the temporal overlap probability as follows.

Keyword Containment Probability. As shown in [9], [35], thekeyword frequency of geo-tagged objects follows a Zipf dis-tribution, i.e., the occurrence of the ith most frequent key-word wi is computed as PrðwiÞ ¼ i�s=

PjVocjj¼1 w�sj , in which s

characterizes the distribution (skewness), and jVocj is thetotal number of distinct keywords. According to the estima-tion model of Wu et al. [9], we have Prðo:key � q:keyÞ ¼P

any list of o:key

Qzj¼1

Prðwqj Þ1�Pj�1

e¼1 Prðwqe Þ�Qm

k¼zþ1Prðwok Þ

1�Pz

e¼1 Prðwqe Þ�Pk�1

f¼zþ1 Prðwof Þ.

Temporal Overlap Probability. Assume that the length ofobject valid time follows a Gaussian distribution: x Nð�,1). The rationale behind is that people’s work time is mostlycentered in the daytime. Let x be the object starting validtime, and y be the starting query time. If the query timeinterval overlaps with the object valid time, we have x � y +jq:tj and y � x + jo:tj. Hence, we have Prðq:t \ o:t 6¼ ?Þ ¼ 1�R G

j q:t j fðxÞdxR x� j q:t j0 gðyÞdy� R G�jo:tj

0 fðxÞdx R G

xþjo:tj gðyÞdy, in

which fðxÞ ¼ 1ffiffiffiffi2pp e�

ðx�"Þ22 and gðyÞ ¼ 1=G. In the end, we

have Prðq:t \ o:t 6¼ ?Þ jo:tj2T2 þ 1ffiffiffiffi

2pp

Tðe�ðjq:tjÞ

2

2 �1Þ.

Fig. 2. A running example for the TABSKQ.

TABLE 2Distances from q to Objects and MBRs in Fig. 2

Obj DstDst Obj DstDst MBR DstDst MBR DstDst

o1 2.3 o5 0.5 R1 2.0 R5 0.4o2 2.2 o6 2.1 R2 0.4 R6 0.6o3 0.8 o7 1.2 R3 0.6 R7 0o4 0.9 o8 2.6 R4 2.1 � �

2604 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

If the query q requests k objects, the total number of theobjects accessed (denoted as h) is k=PrcanðoÞ. Since everynearest neighbor query takes Oðlg jDjÞ [36], the time com-plexity of TC is

Oðh� lg jDjÞ ¼ Oðlg jDj � k=PrcanðoÞÞ: (2)

5 TABSKQ USING TA-TREE

In this section, we first present pruning strategy. Then, wepropose the TA-tree and algorithms for processing theTABSKQ using the TA-tree. Finally, we analyze their timecomplexities and correctness.

5.1 Textual and Temporal Pruning

In Section 4, we know that, in the first step, TC algorithmprogressively retrieves the nearest objects as the candidateobjects. Then, it checks their textual and temporal informa-tion in order to choose the k objects as the answer objects forthe TABSKQ. Unfortunately, it may return too manyunqualified objects as the candidate objects, which degradesits performance. For example, when the number of theobjects whose keywords cover the query keywords is small,or when the number of the objects that are valid during thequery time interval is small, TC could evaluate manyunqualified objects in the first step. This inspires us todevelop a method to prune the search space with textualand temporal information simultaneously. To this end, weintroduce the keyword sensitive valid time, which is the valuecapturing the textual and temporal information of theobjects. We also integrate the query keywords and the querytime interval as a one-dimensional value such that the algo-rithm can prune the objects using textual and temporalinformation constraints via one comparison.

As mentioned earlier, the keyword frequency of geo-tagged objects follows the Zipf distribution [9], [35], mean-ing that only a small number of keywords appear frequentlywhile most of the keywords have low frequencies. Existingapproaches ignore this property. Nevertheless, we distin-guish the objects with their keyword frequencies, in orderto achieve higher query efficiency. Moreover, we utilize aone-dimensional value, i.e., the keyword sensitive validtime, to capture the temporal and textual information of theobjects. This value distinguishes the objects with their key-word frequencies, and maps the valid time of the objects todifferent time periods. To compute it, we first count therankings of the keyword frequencies of the objects. Then,we integrate the rankings of the keyword frequencies andthe valid time interval together. We proceed to detail thecomputation of the keyword sensitive valid time.

Given a set D of objects, we cluster the objects into f þ 1categories Cat ¼ fCat1, Cat2, Cat3; . . . ; Catfþ1} based ontheir keyword frequencies. Specifically, we count the fre-quencies of all the keywords in the dataset and list them indescending order of their frequency. The f most frequentkeywords are defined as the frequent keywords keyF , and theother keywords are defined as the infrequent keywords keyI .For the first f categories in Cat, the ith category Cati cap-tures the objects with the ith most frequent keywords. The(f þ 1)th category captures the objects with the infrequentkeywords. As an example, the keyword frequencies of theobjects in Fig. 2 are {(key5, 5), (key2, 4), (key3, 3), (key4, 3),

(key1, 2), (key6, 2)}. We take the first two keywords, i.e., key5and key2, as the frequent keywords. Thus, the categories ofthe dataset D are Cat = {Cat1 ¼ {o1, o4, o5, o7, o8}, Cat2 ¼ {o1,o2, o3, o5}, Cat3 ¼ {D� o1}}.

Based on the categories, we define the keyword sensitivevalid time of the objects.

Definition 2 (Keyword Sensitive Valid Time). Given anobject o 2 Cati, the keyword sensitive valid time of o, denotedby o:ti, is defined as the valid time of o which is in the ith cate-gory. The starting and ending valid time of o:ti is computed aso:tist = G� ði� 1Þ þ o:st and o:tiet = G� ði� 1Þ þ o:et, whereG is the time period.

In fact, the keyword sensitive valid time of the objects inthe first category is in the range of (0, G), the keyword sensi-tive valid time of the objects in the second category is in therange of (G; 2G), and so forth. Take the objects o1 and o4 inFig. 2 as an example. Since we take key5 and key2 as the fre-quent keywords, o1 with keywords key2 and key5 is in Cat1and Cat2, as depicted in Fig. 3. Thus, the valid time of o1 isin the first and second time periods, and o1:t

i is computedas o1:t

1st ¼ 24� ð1� 1Þ þ 6 ¼ 6; o1:t

1et ¼ 24� ð1� 1Þ þ 10 ¼ 10,

o1:t2st ¼ 30, and o1:t

2et ¼ 34. Similarly, we have o4:t

1 ¼ð13; 17Þ and o4:t

3 ¼ ð61; 65Þ. It is observed that o:ti, a one-dimensional structure, indicates that o contains the ith mostfrequent keyword wi, and its valid time is ðo:ti modGÞ.

We now detail how to compute the keyword sensitivequery interval. Given a TABSKQ q, the keyword sensitivequery interval of q, denoted as q:kt, refers to the time inter-val corresponding to the lth most frequent keywords. l isdefined as the ranking of the query keywords. If the querykeywords q:key do not contain any frequent keyword(s), l isset as f þ 1. Otherwise, if q:key contains frequent keyword(s), l is assigned as the ranking of the least frequent querykeywords among keyF , i.e., l ¼ maxfijwi 2 q:key ^ wi 2 keyFg.The starting and ending valid time of q:kt is computedas q:ktst ¼ G� ðl� 1Þ þ q:st and q:ktet ¼ G� ðl� 1Þ þ q:et.Note that, the categories are listed in descending order ofthe number of objects. Since we set l as the ranking of theleast frequent query keywords among keyF , we only need toevaluate the qualified category with the fewest objects.Here, we say that a category is qualified iff its correspond-ing keyword(s) is(are) in q:key.

Consider Example 1 again. We take key5 and key2 as keyF .Since q:key ¼ fkey3; key5g contains the most frequent key-word key5, l is assigned as 1, and q:kt is computed as q:ktst ¼24� ð1� 1Þ þ 16 ¼ 16 and q:ktet ¼ 24� ð1� 1Þ þ 19 ¼ 19.

Next, we present Lemma 3 to show how to prune thesearch space using o:tl and q:kt by one comparison.

Lemma 3. Given a TABSKQ q and an object o, if q:kt \o:tl ¼ ? , the object o can be safely pruned away.

Proof. If q:kt \ o:tl ¼ ? , there are two cases: (i) o:tl ¼ ? ,indicating that o does not contain the lth most frequentkeyword and hence cannot be an answer object; and

Fig. 3. The computation of keyword sensitive valid time.

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2605

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

(ii) o:tl 6¼ ? but there is no overlap between q:kt and o:tl,meaning that tðq:t; o:tÞ ¼ 0, and thus, o cannot be theanswer object for q. The proof completes. tuLemma 3 indicates that the objects with o:tlst < q:ktet or

o:tlet > q:ktst can be discarded safely. We proceed to pro-pose the TA-tree to improve query performance. It managesto retrieve all the objects o with q:kt \ o:tl 6¼ ? in Oðlg ðlgpÞÞtime, where p ¼ G� ðf þ 1Þ denotes the maximum numberof time units indexed by the TA-tree.

5.2 The TA-Tree

According to Lemma 3, we can prune the objects havingq:kt \ o:tl ¼ ? . However, it is inefficient to check the objectsone by one. Towards this, we design a new index structure,namely, TA-tree. It implements the keyword sensitive start-ing time (i.e., o:tist) as the search key value, and each non-leafnode records the global keyword sensitive valid time ofa set of objects. Based on it, we can prune a set of objectsusing Lemma 3.

The TA-tree is the combination of a vEB-tree [37] andsignature files. A signature file is the superimposition of allthe signatures of its entries, which is employed to prune thesearch space using the textual description of the objects. ThevEB-tree is able to index integers efficiently. In our case,however, each point of interest is a complex object with spa-tio-textual and temporal information. Therefore, we have toextend the vEB-tree to store the objects.

If the size of the TA-tree represents the number of thetime units it indexes, we suppose the size of the TA-tree is2m with 2m�1 < p � 2m. Given a root node Nr (its sizeNr:size is 2m) of the TA-tree, we define two parameters,Nr:fo and N:size, which are employed when building theTA-tree: Nr:fo ¼ 2dðlgNr:sizeÞ=2e and N:size ¼ 2bðlgNr:sizeÞ=2c.Specifically, Nr:fo represents the fanout of Nr, and N:sizedenotes how many time units the child nodes N of Nr

indexes. As illustrated in Fig. 4, we build a TA-tree T basedon the dataset in Fig. 2. In fact, a TA-tree root node parti-tions Nr:size time units into Nr:fo child nodes, with eachchild node indexing N:size time units. In particular, we takekey5 and key2 as the frequent keywords, and p ¼ 24�ð2þ 1Þ ¼ 72. Hence, we have Nr:fo = 16 andN:size = 8.

We now detail the non-leafnode and leafnode of the TA-tree. The TA-tree is built in a top-down manner. The nodesin the same level share the same fanout, while the nodes indifferent level share different fanouts. Given a non-leafnode

N whose parent node is denoted by Np, it includes N:size,an array cluster[0; . . . ; N:fo� 1], min, max, and maxe.N:size indicates how many time units N indexes. The arraycontains N:fo pointers pointing to the subtrees of N . Here,N:fo ¼ 2dðlgN:sizeÞ=2e and N:size ¼ 2bðlgNp:sizeÞ=2c. Comparedwith the vEB-tree, min, max, and maxe are newly addedelements. Specifically, min and max are the minimum andmaximum o:tist values of the objects in N , while maxe is themaximal o:tiet value of the objects in N . These values help usto prune the objects with q:kt \ o:tl ¼ ? . Given a TABSKQ qand a TA-tree non-leafnode N , if conditions q:ktst > maxe

or q:ktet < min hold, the whole subtree rooted by N can bediscarded according to Lemma 3.

The leafnode of the vEB-tree only contains 2 digits. How-ever, for a leafnode LN of the TA-tree, it stores two timeunits and four other elements, viz., min, max, T -m, andT -M. Both min and max are Boolean values. If min (ormax) is false (i.e., 0), it indicates that there is no object owhose o:tist is the time unit denoted by min (or max), andT -m (or T -M) is assigned as NIL (N). If min (or max) is true(i.e., 1), a pointer associating with min (or max) is gener-ated, and points to the set O of the objects (at least oneobject) whose o:tist is the time unit represented by min (ormax). T -m (or T -M) stores the largest o:tiet of the objects inO. A signature file (SigFile) is created for O to maintain thekeywords of those objects.

In addition, we utilize a Grid technique to store theobjects in the same leafnode LN . Specifically, the Grid tech-nique [38] partitions the euclidean space into a predefinednumber of equal-sized cells, and the objects in LN are dis-tributed in the cells accordingly. During the search, weincrementally access the next nearest cells, and compute thescores FðoÞ of the objects in the cells. According to Lemma2, if the euclidean distance from a cell to the query locationis larger than �, the algorithm terminates the search, andtherefore, the query cost is reduced significantly.

In what follows, we present our approaches for updatinga TA-tree, including insertion and deletion procedures. Theinsertion procedure aims to insert an object o into the TA-tree, and works in a top-down manner, by considering threeparameters, i.e., the spatial location, the keyword descrip-tion, and the valid time of the object. It first computes o:ti ofo, and sets a variable x ¼ o:tist. Then, the procedure traversesthe TA-tree. Assume that N is the jth level non-leafnode ofthe TA-tree, and its size is N:size. We define two parame-ters, highjðxÞ and lowjðxÞ, which are employed when insert-ing o into the jth level non-leafnode of the TA-tree.highjðxÞ ¼ bx=N:sizec indicates that o is inserted into thenon-leafnode NhighjðxÞ, and lowjðxÞ ¼ ðxmodN:size) is usedto insert o into the (jþ 1)th level of the TA-tree. Note that,NhighjðxÞ is the child node of N pointed by the No. highjðxÞpointer. After o is inserted into N , we set x ¼ lowjðxÞ, andinsert o into the child node of N in the same manner. Theabove procedure continues, and the value of x keeps shrink-ing until the insertion procedure encounters a leafnode. Ifx ¼ 0 (x ¼ 1), we insert o into the object set O pointed bymin (max), and elements min and T -m (max and T -M) areupdated based on o:ti. The keyword set of o is added to theSigFile, and the procedure stops accordingly.

The deletion procedure works in a top-down fashion aswell. It traverses the TA-tree in the same way as the

Fig. 4. Example of the TA-tree for the dataset in Fig. 2.

2606 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

insertion operation, and updates the elements min, max,and maxe in the non-leafnode, according to the value of o:ti.When the procedure encounters a leafnode, if the set O ofthe objects pointed by min (max) possesses only one object,min (max) is assigned as false, and T -m (T -M) is set as NILafter the deletion. If O contains more than one object, T -m(T -M) is updated based on o:ti of the remaining objectspointed bymin (max).

5.3 TT Algorithm

In this section, we first present two procedures, namely,searching time unit (ST) algorithm and searching time unit pre-decessor (STP) algorithm. The two algorithms help us to findthe objects with o:tl \ q:kt 6¼ ? . Based on them, we proposethe TABSKQ using TA-tree (TT) algorithm.

ST Algorithm. ST aims to find the objects (whose keywordsensitive starting time is a specified value x) and their corre-sponding SigFiles. It takes as inputs a TA-tree T and a speci-fied time unit x, and performs in a recursive manner, withits pseudo-code depicted in Algorithm 2. For a given timeunit x, it first recursively invokes ST(ThighjðxÞ, lowjðxÞ) until aleafnode (i.e., Ti:size = 2) is encountered (line 5). Here, Ti

denotes the subtree rooted by node Ni. highjðxÞ and lowjðxÞare computed in the same way as in the update procedurementioned above. If a leafnode is encountered, ST returnsthe objects and their corresponding SigFile pointed by minormax according to the value of x (lines 1–4).

Algorithm 2. Searching Time Unit (ST)

Input: a TA-tree T on a dataset D, a specified time unit xOutput: The set O of the objects whose starting time is x and itscorresponding Signature File O.SigFile1: if T:size = 2 then2: if x = 0 then3: return O and O.SigFile pointed bymin4: else return O and O.SigFile pointed bymax5: else ST(ThighjðxÞ, lowjðxÞ)

Example 2. Using the TA-tree shown in Fig. 4, we give anexample of finding the objects whose o:tist is 21 (i.e.,x ¼ 21) by using ST algorithm. (1) For the second levelnode N , we have N:size ¼ 8, and thus, high2ð21Þ ¼b21=8c ¼ 2 and low2ð21Þ ¼ 21mod 8 ¼ 5, ST(T2, 5) isinvoked. (2) As N:size is 8, the size of its child nodes is 2.We have high3ð5Þ ¼ b5=2c ¼ 2 and low3ð5Þ ¼ 5mod 2 ¼ 1,and ST(T2;2, 1) is invoked. (3) Since T2;2 is a leafnode andx ¼ 1, o5 with its SigFile, pointed by the element max, isreturned.

STP Algorithm. We now present STP algorithm. It aims tofind a set O of the objects whose o:tist is just smaller than xwith its corresponding SigFiles, and performs in a top-down manner. Note that, those objects share the same o:tistvalue. The pseudo-code of STP is presented in Algorithm 3,and it also adopts a recursive manner. Lines 1–4 handle thecase where T is a leafnode. In this case, if x ¼ 1, the set ofthe objects having o:tist ¼ 0 are returned. Lines 5–18 dealwith the situation where T is a non-leafnode. It checks if thesmallest time unit indexed by NhighjðxÞ is smaller than x (line7). If yes, STP recursively invokes itself to retrieve the

predecessor in the subtree ThighjðxÞ (lines 8–9). If not, in lines10–18, it tries to find the predecessor in the subtrees Tið0 �i < highjðxÞÞ. In the subtrees Ti, STP retrieves the objectswith o:tist ¼ N:size� 1 (lines 12–15). If no object is returned,it proceeds to retrieve the predecessor of N:size� 1 in thissubtree (lines 16–18). Once a set of the objects is found, thoseobjects together with their SigFiles are returned, andSTP stops.

Algorithm 3. Searching Time Unit Predecessor (STP)

Input: a TA-tree T on a dataset D, a specified time unit xOutput: the set O of objects whose starting time is just smallerthan x in T and its corresponding Signature File O.SigFile1: if T:size ¼ 2 then2: if x ¼ 1 then3: return O and O.SigFile pointed bymin4: else returnNIL5: else pre-c ¼ highjðxÞ6: pre-min ¼ NhighjðxÞ:min7: if pre-min < x then8: (O, O.SigFile) STP(ThighjðxÞ; lowjðxÞ)9: return O and O.SigFile10: else11: while pre-c � 0 do12: pre-c ¼ pre-c � 113: (O, O.SigFile) ST(Tpre�c,N:size� 1)14: if O 6¼ ? then15: return (O, O.SigFile) and break16: (O, O.SigFile) STP(Tpre�c, N:size� 1)17: if O 6¼ ? then18: return (O, O.SigFile) and break

TT Algorithm. Based on ST and STP algorithms, we arenow ready to propose the TABSKQ using TA-tree (TT)algorithm. TT algorithm adopts a filter-and-refinementframework to address TABSKQ. In the filter phase, it findsthe objects with o:tl \ q:kt 6¼ ? using ST and STP algorithm.TT first finds the objects satisfying o:tlst ¼ q:ktet using ST,and then, it retrieves the objects whose o:tlst values aresmaller than q:ktet using STP. If TT encounters a TA-treenode with maxe < q:ktst, all the objects in this node can bepruned according to Lemma 3. In the refinement phase, TTprunes the objects that do not contain all the query key-words, and chooses the k objects having the highest scoresas the final result.

Algorithm 4 depicts the pseudo-code of TT. It first initial-izes a max-heap C and a set R. C stores candidate objects,and the scores FðoÞ of the candidate objects serve as the keyvalue. R stores the object sets with their SigFiles returned byST and STP algorithms. In the filter stage (lines 3-9), TTinvokes ST to retrieve the objects with o:tlst ¼ q:ktet (line 3),and then, it calls STP to retrieve the objects with o:tlst < q:ktet(lines 5-9). Note that, the objects with o:tlst > q:ktet oro:tlet < q:ktst can be pruned by Lemma 3. Then, TT performsthe refinement phase (lines 10–17). For each tuple (O, O.Sig-File) in R, if O.SigFile covers q:key, the algorithm incremen-tally validates the next nearest objects o 2 O to q. IfDstðo; qÞ < � and FðoÞ > FðokÞ (ok is the kth object in C), ois added to C and TT updates � accordingly. Finally, the top-k objects in C are inserted into the result set Sr and arereturned as the final result. Note that, a time stamp is actually

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2607

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

a special case of the time interval, and thus, our approach canbe easily adjusted to answer the time stamp query.

Algorithm 4. TABSKQ Using TA-Tree (TT)

Input: a TA-tree T on a dataset D, a query object q in the formof (q:loc, q:key, q:t, q:r), the number k of the objects requested,a balanced parameter aOutput: the result set Sr for q/* �k ¼ ð1=FðokÞ � 1Þ=a, and ok is the kth object in C. */1: Sr ? , � ¼ q:r, C a new max-heap, R ?

2: compute q:kt ¼ (q:ktst, q:ktet)3: R ¼ R [ ST(T , q:ktet) // filter phase4: st q:ktet5: for (i ¼ highjðstÞ; i � 0; i��) do6: if Ti:maxe � q:ktst then // Lemma 37: while Ti:min � st do8: R ¼ R [ STP(Ti, lowðstÞ)9: st the starting time of O10: for each ðO;O.SigFile) in R do // refinement phase11: if O.SigFile covers q:key then12: for the next nearest object o 2 O to q do13: if o:key covers q:key then14: ifDstðo; qÞ < � and FðoÞ > FðokÞ then15: delete ok from C16: C.Enqueue(o, FðoÞ) and update �17: else break18: return Sr formed by the top-k objects in C

Example 3. Consider Example 1 again. We have q:kt ¼ð16; 19Þ. TT first finds the objects whose o:tist is q:ktet ¼ 19using ST algorithm. For the second level node N of theTA-tree, we have N:size ¼ 8, and thus, high2ð19Þ ¼ 2,low2ð19Þ ¼ 3, and ST ðT2; 3Þ is called. As the third levelnode has size=2, we have high3ð3Þ ¼ 1, low3ð3Þ ¼ 1, andST ðT2;1; 1Þ is invoked. However, since T2;1 is a leafnodeand it does not contain any object, no object is returned.Next, it utilizes STP to find the objects having o:tist < 19.Similar as ST algorithm, o7 is first retrieved witho7:t

1st ¼ 17. Then, o4 and o8 are also returned as the prede-

cessors of o7. Since the keywords of o7 cover the querykeywords, it is inserted into C with Fðo7Þ = 0.67/(1 + 1.2)= 0.305, and � is updated to 1/0.305 � 1 = 2.28. As Fðo4Þ= 0.33/(1 + 0.9) = 0.174 � Fðo7Þ and Dst(q, o8) = 2.6 � �,objects o4 and o8 are discarded. Finally, fo7g is returnedas the final result.

5.4 Discussion

In what follows, we analyze the time complexities of ST,STP, and TT algorithms respectively, and then, we provethe correctness of TT algorithm.

Theorem 1. Assume that p is the maximum number of distincttime stamps indexed by the TA-tree with p ¼ 2m. The timecomplexity of ST is Oðlg ðlgpÞÞ.

Proof. The recursive procedure that implements on the TA-tree has the running time below, characterized by therecurrence

T ð2mÞ � T ð2bm=2cÞ þOð1Þ; (3)

where T ðnÞ represents the running time of the TA-treewith size n. Since bm=2c � 2m=3 (if m � 2), we haveT ð2mÞ � T ð22m=3Þ þOð1Þ � T ð22m=3�2=3Þ þOð1Þ þOð1Þ �� � � � T ð20Þ þ Oð1Þ �OðlgmÞ ¼ OðlgmÞ. As m ¼ lgp,T ðpÞ � Oðlg ðlgpÞÞ. Consequently, the time complexity ofST algorithm is Oðlg ðlgpÞÞ. tu

Theorem 2. Assume that p is the maximum number of distincttime stamps indexed by the TA-tree with p ¼ 2m. The timecomplexity of STP is Oðlg ðlgpÞÞ.

Proof. Recall that STP algorithm presented in Algorithm 3,depending on the test in line 7, calls itself recursivelywith either line 8 (traversing the subtree ThighjðxÞ) or line16 (retrieving the subtrees in front of ThighjðxÞ). Thus, therecursive procedures have the following running time,characterized by the recurrence

T ð2mÞ � k1 � T ð2bm=2cÞ þOð1Þ: (4)

It can be seen that Eq. (4) is similar as Eq. (3). Ask1 � N:fo (the TA-tree has N:fo subtrees), the time com-plexity of STP algorithm is Oðlg ðlgpÞÞ. tu

Theorem 3. Assume that l is the ranking of q:key, and q:r ¼ 1.The time complexity of TT is OðB� Prðq:t \ o:t 6¼ ?Þ�lg ðlgpÞÞ, in which B is the cardinality of Catl.

Proof. For TT, only the objects with o:tlst \ q:kt 6¼ ? need tobe evaluated. If 1 � l � f , the number of B isjDj � PrðwlÞ, where wl is the lth most frequent keywordin the dataset. If l ¼ f þ 1, the number of B is jDj�PrðkeyIÞ, in which PrðkeyIÞ ¼ Prðwfþ1 [ wfþ2 [ � � � [ wjvocjÞ.Hence, B ¼ maxfjDj � PrðwlÞ; jDj � PrðkeyIÞg. As men-tioned in Section 4.3, the probability of an object o beingvalid during the query time interval is Prðq:t \ o:t 6¼ ?Þ.Thus, the number of the objects having o:tlst \ q:kt 6¼ ? isB� Prðq:t \ o:t 6¼ ?Þ. From Theorems 1 and 2, we alsoknow that every accessing of a TA-tree takes Oðlg ðlgpÞÞ.Consequently, the time complexity of TT is OðB�Prðq:t \ o:t 6¼ ?Þ � lg ðlgpÞÞ. tuFinally, we prove the correctness of TT algorithm.

Theorem 4. Given a TABSKQ q, TT algorithm can correctlyreturn all the answer objects for q.

Proof. First, TT algorithm only prunes away those unquali-fied objects with o:tl \ q:kt ¼ ? according to Lemma 3, byusing our proposed TA-tree. Therefore, no answer objectsare missed (i.e., no false negative). Second, each candidateobject o 2 C is either verified in the refinement step or dis-carded by Lemma 2, which ensures no false positive. Theproof completes. tu

6 JOINT TABSKQ

In this section, we study an interesting variant of TABSKQ,i.e., Joint TABSKQ (JTABSKQ), which is the joint processingof a set of TABSKQs. It is mainly motivated by two aspects.First, efficient processing of multiple queries: With a high loadof similar queries, joint processing is able to handle multiplequeries more efficiently. Second, privacy protection: Userquery logs in a service provider may be accessed illegally.The fake query technique is often used to offer keyword

2608 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

privacy protection [9], where the true query is hiddenamong multiple fake queries. Using joint processing, theactual query can be processed together with fake queries, inorder to protect user privacy. In view of these, JTABSKQ isof important application value, and efficient algorithmsneed to be developed. Below, we formalize JTABSKQ.

Definition 3 (JTABSKQ). Given a JTABSKQ Q = {qi}, eachsubquery qi is a TABSKQ. A joint time-aware Boolean spa-tial keyword query (JTABSKQ) Q aims to return the resultsets for Q = {qi} simultaneously.

A naive method to support JTABSKQ, namely, ITT algo-rithm, is to tackle the subqueries individually, i.e., ITT itera-tively invokes TT to retrieve the result set for each qi.However, the drawback of ITT is obvious that, it may visit aTA-tree multiple times, incurring high I/O andCPU costs. Asa result, we propose an efficient algorithm, i.e., Joint TABSKQusing TA-tree (JTT) algorithm, to address JTABSKQ. JTT aimsto process the subqueries concurrently and thus reducesquery cost. It divides the subqueries into groups and mean-while prunes the search space for the group of queries. UnlikeITT, JTT reduces significantly the number of TA-tree accessesand hence improves query performance. In the sequel, wedetail the groupingmethod and JTT algorithm.

Grouping Method. In order to reduce the high computa-tional cost of JTABSKQ, we present a grouping method,which is to cluster similar queries in Q into groups L ¼fQ1; Q2; . . . ; Qvg. The grouping approach takes into accountboth query locations and query keywords. After groupingthe queries, the number of objects accessed by JTT is:ScanðQÞ ¼Pv

i¼1 QiðcandidateÞ � ðjQijÞ, in which QiðcandidateÞis the number of candidate objects for a query set Qi, andjQij is the number of query objects in Qi. The optimizationof JTT needs to consider 2 criteria: (i) jQij should be smallfor reducing QiðcandidateÞ. (ii) The number of groupsshould be small to cut down the visiting times of the TA-tree. Notice that criterion (i) actually conflicts with criterion(ii). Next, we proceed to prove that finding an optimalgrouping solution for JTT is NP-hard.

Theorem 5. The problem of finding a grouping solution thatminimizes ScanðQÞ is NP-hard.

Proof. This problem can be proved by reduction from thewell known bin packing problem. Without loss of gener-ality, we assume that k ¼ 1, and all the query objects havethe same query location. JTT algorithm first partitionsthe query objects into v groups, such that for every

group, QiðcandidateÞ � ðjQijÞ �PjQij

j¼1 qjðcandidateÞ, whereqjðcandidateÞ is the number of candidate objects for aquery qj. Given an instance of bin packing problem, eachbin corresponds to a group, and the size of every item isqjðcandidateÞ. There is a solution for the bin packingproblem that packs all the items in v bins of sizePjQi j

j¼1 qjðcandidateÞ iff there is a solution for our problem.The proof completes. tuDue to the NP hardness of the optimization, we present a

method that is able to obtain a near-to-optimal solution. Inthe first step, for each subquery qj, we detect the rankingqj:key

r of its query keywords. Then, for each group Qi ofqueries in L, we randomly assign a query object as the seed

seedi, and insert it into Qi. Next, for the rest of queries {qjjqj 2 Q ^ qj 62 L}, we compute the similarity between qj andseedi. The similarity is defined as Cðqj; seediÞ ¼ b�Dstðqj; seediÞ þ ð1� bÞ � ðjqj:keyr � seedi:key

rjÞ, in which b

is a tuning parameter that balances the spatial proximityand the textual similarity. qj is inserted into Qi with thesmallest C(qj, seedi). After L contains all the query objects,we update the seed seed0i of each group via the mean valuemethod, and then empty all the groups and re-insert all thequeries based on new seeds seed0i.

Algorithm 5. JTABSKQ Using TA-Tree (JTT)

Input: a TA-tree T on a dataset D, and a JTABSKQ Q in theform of {q1, q2, q3, ...}Output: the result set SG for the query set Q, which is in theform of {s1, s2, s3, ...}1: for each result set si in SG do2: si a new max heap3: divide the query set Q into L ¼ fQ1; Q2; . . . ; Qvg4: for each query set Qi in L do5: QI ? and R a new set6: for each query qj 2 Qi do7: compute (qj:st, qj:et) as (qj:ktst, qj:ktet)8: merge the interval (qj:ktst, qj:ktet) with QI9: for each interval (ta, tb) in QI do // group filtering10: st tb and R ¼ R [ ST(T , st)11: for (i ¼ highjðstÞ; i � 0; i��) do12: if Ti:maxe � ts then13: while Ti:min � st do14: R ¼ R [ STP(Ti, lowðstÞ)15: st the starting time of O16: for each object o in R do // individual refinement17: for each query qj in Qi do18: if o:key covers qj:key then19: ifDstðqj; oÞ < �j and FðoÞ > FðokÞ then20: delete ok from sj21: sj.Enqueue(o, FðoÞ) and update �j

22: return SG

JTT Algorithm. JTT takes as inputs a TA-tree T and aJABSKQQ, and outputs the result set for Q. It adopts a groupfiltering and individual refinement framework. Algorithm 5shows the pseudo-code of JTT. It divides the query set Q intoL (line 3). For each subquery qj in L, the algorithm computesits keyword sensitive query interval, and merges it with QI(lines 6–8). In the group filtering phase, for every query timeinterval (ta, tb) inQI, JTT traverses the TA-tree to retrieve theobjects with ðta; tbÞ \ o:ti 6¼ ? using ST and STP algorithms,and inserts those objects into R (lines 9–15). Thereafter, in theindividual refinement phase, for each object o in R, it checkswhether o satisfies the requirement of qj (lines 16–21). Finally,the result set SG for the query setQ is returned (line 22).

7 EXPERIMENTS

In this section, we experimentally evaluate the performanceof our proposed indices and algorithms for TABSKQ andJTABSKQ, using real datasets.

7.1 Experimental Setup

We use three real datasets, Yelp, GN1, and GN2. Yelp con-sists of 12,773 objects each of which includes object valid

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2609

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

time, keyword description, and spatial location. We employthis dataset in our case study. GN1 and GN2 are extractedfrom the Geographic Names Information System in USA,1

and contain the geographic name usage in the US govern-ment. Moreover, we assign each object to a period of validtime. Recall that the theoretical analysis in Section 4.3, thevalid time of the objects follows the Gaussian distribution,which are generated in the range of [1, 24] hours. GN1(GN2) contains 199,206 (2,257,651) objects, and the averagenumber of object keywords is 9.7 (9.3). As confirmed in [9],the keyword frequency of the objects in GN1 and GN2 fol-lows the Zipf distribution.

We investigate the performance of our presented algo-rithms for TABSKQ and JTABSKQ under a variety of param-eters listed in Table 3. In the experiments, we measure boththe query time and the I/O overhead (i.e., the number ofnode/page accesses). 100 random queries are evaluated inevery set of experiments, and their average performance isreported. The query objects are created in a way similar as[23]. Specifically, we randomly select 100 objectswhose num-ber of keywords is more than 9, and then, we randomlychoose nine object keywords as the query keywords. All thealgorithms were implemented in C++, and all the experi-ments were conducted on a PC with an AMD FX 83504.0 GHz CPU and 16G RAM. All indices are disk resident,and the page size is fixed to 4,096 bytes. In order to reducethe I/O cost, each data page is kept at most one leaf node ofthe TA-tree. Also, we assume that the server maintains a 40MB buffer using LRU as the cache replacement policy.

7.2 Case Study

We verify the effectiveness of our proposed TABSKQ usingYelp dataset. We consider two types of queries: TABSKQand Boolean Spatial Keyword (BSK) query. The difference

between TABSKQ and BSK query is that the BSK queryignores the temporal information of objects while theTABSKQ not. We run 3 top-3 TABSKQs and BSK querieswith their results illustrated in Fig. 5, confirming their use-fulness. Table 4 shows the answer objects for the queries.For query qj (1 � j � 3), fo3j�2; o3j�1; o3jg (denoted bycircles) are the top-3 answer objects for the TABSKQ, whilefo03j�2; o03j�1; o03jg (denoted by squares) are the top-3 answerobjects for the BSK query. For instance, q1 is issued by a userwho wants to retrieve the objects that are valid during(19:00, 20:00), contain keywords “Burgers, Fast Food”, andare within 10 km. Although o02 and o03 returned by the BSKquery are close to q1 as well as contain the query keywords,they are not valid during the query time interval. However,the objects o1, o2, and o3 returned by the TABSKQ satisfyuser’s need. As another example, q3 aims to find the objectsthat are valid during (9:00, 10:00), contain keywords“Coffee&Tea”, and are within 5 km. Although o07 and o09 arecloser to q3 and contain all the query keywords, they are notvalid during the query time. Thus, objects {o7, o8, o9} are bet-ter choices for q3. In a word, TABSKQ is more useful insome real-life application.

7.3 Tuning Experiments

In this set of experiments, we tune the parameters in the TA-tree in order to achieve its best performance. First, we tunethe number f of frequent keywords in the TA-tree. Fig. 6depicts the query costs on GN1 when f varies from 100 to500. Obviously, the TA-tree performs better as f increases.It performs the best on GN1 when f reaches 400. The reasonis that the average number of the objects in a categorydecreases as f grows. In Section 5.4, we know that TT onlyevaluates the objects in one category, and thus, its perfor-mance improves when f increases.

TABLE 3Parameter Settings

Parameter Range Default

jq:keyj (# of query keywords) 1, 3, 5, 7, 9 5k (# of the objects requested) 1, 3, 5, 7, 9 5# of query time interval (hour) 1, 3, 5, 7, 9 5a 0.5, 1, 1.5, 2, 2.5 1# of queries in JTABSKQ 1,000 through 5,000 1,000search radius (km) 4, 8, 12, 16, 20 12

Fig. 5. Case study.

TABLE 4The Answer Objects for Queries in Fig. 5

qj oi oi:t oi:key Dst

q1

o1 (11:00, 22:00) Burgers, Fast Food, Restaurants 1.0o2 (12:00, 22:00) Burgers, Fast Food, Restaurants 7.3o3 (5:00, 24:00) Burgers, American (traditional)

Fast Food, Restaurants8.6

o02 (9:00, 17:00) Burgers, Fast Food 2.1o03 (8:00, 17:00) Burgers, Fast Food 2.3

... ... ... ... ...

q3o7 (7:00,15:00) Coffee&Tea, Sandwich 1.2o8 (7:00, 19:00) Coffee&Tea, Beer bar, Nightlife 2.1o9 (6:00, 24:00) Coffee&Tea, Breakfast 2.3o07 (12:00, 24:00) Coffee&Tea, Beer bar 1.1o09 (11:00, 24:00) Coffee&Tea, Nightlife 2.2

Fig. 6. Tuning the TA-tree for TABSKQ.

1. Available at http://geonames.usgs.gov/domestic.

2610 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

Next, we tune the number of groups and b in the TA-treefor JTT. Recall that an optimal trade-off between the two cri-teria is difficult to obtain, as discussed in Section 6. Fig. 7aplots the query costs on GN1 by varying the number ofgroups from 3 to 11. The TA-tree performs the best whenthe number of groups is 7. Fig. 7b shows the query costs bychanging b from 0 to 1 on GN1. b is the parameter that bal-ances the textual similarity and the spatial proximity whengrouping the queries. The TA-tree performs the best when b

= 0, indicating that the queries are only grouped with tex-tual similarity.

We also tune these parameters on GN2 dataset. The TA-tree performs the best on GN2 when f reaches 2,000, and JTTperforms the best when the number of groups is 9 and b = 0.In the rest of the experiments, these values are set as defaults.

7.4 Indexing Efficiency

This set of experiments evaluates the construction time andstorage space of both the CIRþ-tree and the TA-tree. Whenbuilding the TA-tree, the value of parameter f on GN1(GN2) is 400 (2,000). The experimental results are depictedin Table 5. It is observed that building the TA-tree is muchfaster than building the CIRþ-tree. For example, the estab-lishing time of the TA-tree on GN1 is less than 1/2 of that ofCIRþ-tree. A CIRþ-tree clusters the objects with similar key-words using k-means clustering algorithm, whose time costis polynomial. In contrast, a TA-tree projects the objects tof þ 1 disjoint time periods, whose time cost is linear. Also,the storage space of TA-tree is much smaller than that ofCIRþ-tree. The reason is that, every leafnode and non-leafnode of the CIRþ-tree contain an inverted file, while theTA-tree only keeps signature files for the leafnodes.

In addition, we verify the update efficiency of the TA-tree, with the results shown in Table 5. Note that, in Table 5,“–” represents no update time. This is because the CIRþ-tree is a CIR-tree with temporal information, and the CIR-tree does not support update operation.

7.5 Results on TABSKQ

In this set of experiments, we report the results on TABSKQprocessing. To give a comprehensive comparison, in additionto the CIRþ-tree and TA-tree, we also implement another twobaseline indices called Inverted R-tree and time-travel invertedindex (MI) [30], respectively. An Inverted R-tree is an R-treeextended with an inverted file. The TABSKQ using Inverted R-tree (TR) algorithm first retrieves the objects that are withinthe search radius, and then, it finds the objects containing allthe query keywords and returns the k objects with the highestscores FðoÞ. An MI captures the valid time and keywords ofthe objects, and the relaxation parameter of it is set as 100. TheTABSKQ using MI (TM) algorithm finds the objects that arevalid during the query time interval and contain all the query

keywords, and then, it returns the k objects having the highestscoresFðoÞ.

Effect of jq:keyj. We first investigate the influence of thenumber of query keywords, with the result shown in Fig. 8.The first observation is that TT is several orders of magnitudebetter than the other algorithms, in terms of both query timeand I/O cost. This is because TT is able to prune the searchspace using the textual and temporal information simulta-neously. The second observation is that the query costs of TCgrow, while those of other algorithms drop, as the number ofquery keywords increases. The reason is that with more key-words, TC needs to scan larger space, whereas other algo-rithms have fewer objects added to the candidate sets in thefilter phase. Since the algorithms perform similar on GN1 andGN2, we only present the performance of the algorithms onGN2 in the rest of this set of experiments.

Effect of k. Next, we study the impact of varying k (i.e., thenumber of the objects requested) with the results plotted inFig. 9a. Clearly, TT outperforms other algorithms by a widemargin. On the other hand, the query costs of the algorithmsare mostly stable, since the size of candidate sets does notchange much as k ascends, which is consistent with our the-oretical analysis.

Effect of jq:tj. Then, we inspect the influence of the length ofquery time interval (i.e., jq:tj), with the results depicted inFig. 9b. TT again exceeds other algorithms significantly. Inaddition, the query costs of TT, TC, and TM ascend with thegrowth of jq:tj, becausemore objects are valid during jq:tj, andmore objects are evaluated. However, the performance of TRremains almost stable, since TR regards the objects within thesearch radius as the candidate objects, and the growth of jq:tjdoes not affect the size of the candidate set in TR.

Effect of a. The fourth experiment explores the effect ofparameter a on the performance of the algorithms. As illus-trated in Fig. 9c, a almost has no impact on the algorithms,and query performance is stable. This is because the majorcosts of the algorithms are to retrieve candidate objects, andthe change of a does not affect the size of candidate sets.

Effect of different query keyword frequencies. The fifth experi-ment studies the query performance when the query

Fig. 7. Tuning the TA-tree for JTABSKQ.

TABLE 5Indexing Efficiency of CIRþ-Tree and TA-Tree

CIRþ-tree TA-tree

establishing time (sec) for GN1 502.5 172.1establishing time (sec) for GN2 3,975.3 2,121.5size (MB) of GN1 182.2 53.4size (MB) of GN2 2,157.6 597.1update time (sec) of GN1 – 3.49update time (sec) of GN2 – 8.24

Fig. 8. TABSKQ costs versus # of query keywords.

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2611

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

keyword frequencies vary, with the results plotted inFig. 9d. Specifically, we implement three types of query key-words. “Frequent” query keywords are derived from the fmost frequent keywords in the datasets. “Infrequent” key-words are derived from the keywords apart from the fre-quent keywords. “Random” keywords are generated in theway as mentioned in Section 7.1. It is observed that TT out-performs other algorithms. Another observation is that TTperforms the best when the query keywords are frequentkeywords, and performs the worst with infrequent key-words. This is because TT retrieves the answer objects inone category. Although the frequency of every infrequentkeyword is low, the sum of these frequencies is still high,and the number of the objects in the infrequent keywordcategory is large. However, TM performs the best when thequery keywords are infrequent keywords, since it onlyneeds to evaluate the objects whose keywords cover thequery keywords. When the query keywords are infrequentkeywords, fewer objects need to be evaluated.

Effect of Search Radius. The sixth experiment evaluatesthe algorithms using GN2 dataset with varying searchradius from 4 to 20 km. As illustrated in Fig. 9e, TT againperforms the best since the other algorithms scan manyirrelevant objects. Another observation is that the querycosts of TT and TM are almost stable while those of TCand TR increase. This is because TT and TM first prune theobjects with temporal and textual information, while TCand TR prune the objects with spatial constraint. When thesearch radius increases, TC and TR have to scan moreobjects.

Effect of k for Time Stamp Query. The seventh experimentverifies the algorithms for the time stamp query when kchanges. As mentioned in Section 5.3, the proposed algo-rithms can tackle the time stamp query after easy modifica-tion. As shown in Fig. 9f, TT again performs the best, sinceit can prune the search space with textual and temporalinformation simultaneously. In contrast, other algorithmshave to prune the search space using spatial, textual, andtemporal information separately.

Effect of k for Valid Time Distribution. In the eighth experi-ment, we investigate the algorithmic performance when thevalid time of the objects follows the uniform distribution.As depicted in Fig. 9g, TT outperforms other algorithms byseveral times, which is consistent with our analysis.

Effect of k for Ranking Function. In the ninth experiment,we evaluate the algorithms using the Linear Combinationpresented in Section 3. As shown in Fig. 9h, the rankingfunction does not influence query performance much. Thisis because all the algorithms have to first find the candidateobjects, and the change of ranking functions does not affectthe cardinality of candidate sets.

Effect of Object Category. Finally, we study the influence ofobject category proposed in Section 5.1. We implement a newbaseline method, i.e., Enhanced TC (ETC) algorithm, whichworks in away as follows: It groups the objects into f þ 1 cate-gories in the same way as the TA-tree, and builds one CIRþ-tree for each category of objects. That is to say, we build f þ 1CIRþ-trees. Given a TABSKQ q, ETC first computes the rank-ing l of the query keywords using the method presented inSection 5.1. Then, it retrieves the answer objects with TC algo-rithm in the lth CIRþ-tree. This CIRþ-tree contains all theobjects having the lth most frequent keyword. As shown inTable 6, TT achieves the best performance because it canprune the objects with textual and temporal informationsimultaneously. Also, ETC outperforms TC since it only eval-uate the objectswith the lthmost frequent keyword.

7.6 Results on JTABSKQ

In this set of experiments, we explore the efficiency of ITTand JTT for JTABSKQ. First, we study the influence of thenumber of queries. As depicted in Fig. 10, the query costs ofITT are in proportion to the number of queries, and JTT out-performs ITT significantly. The reason is that ITT retrievesthe results for the subqueries one by one, whereas JTT clus-ters similar queries into groups, which avoids traversingthe TA-tree repeatedly. Also, JTT prunes the search spacefor the group of queries, which achieves better performance.

Fig. 9. TABSKQ costs on GN2.

TABLE 6TABSKQ Costs versus# of Query Keywords

# of query keywords 1 3 5 7 9

TC-time (s) 4.1 5.2 6.3 6.4 7ETC-time (s) 1.5 1.4 1.2 1.1 1.1TT-time (s) 0.46 0.32 0.28 0.27 0.27TC-I/O 873 935 987 1,052 1,125ETC-I/O 799 713 650 632 615TT-I/O 623 606 559 337 310

2612 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

Last but not the least, we inspect the impact of k on ITTand JTT algorithms. As illustrated in Fig. 11, JTT exceedsITT significantly. Another observation is that the querycosts of ITT and JTT remain stable, because the candidateset does not change much as k grows. Note that, the experi-mental results are consistent with our analysis presented inSection 6.

8 CONCLUSION

In this paper, we identify and solve a new type of spatialkeyword queries, namely, time-aware Boolean spatial keywordquery. To address this, the CIRþ-tree and its correspondingalgorithm are first developed as a feasible solution. Then,we propose a new index structure, i.e., TA-tree, and corre-sponding pruning strategies and algorithms to processTABSKQ efficiently. As a second step, we extend our techni-ques to tackle an interesting TABSKQ variant, i.e., JointTABSKQ (JTABSKQ). Finally, extensive experimental evalu-ation using real datasets demonstrates the effectiveness andefficiency of our proposed indices and algorithms. In thefuture, we intend to investigate the TABSKQ on a pre-defined road network. Also, we plan to explore why-notquestions on TABSKQ.

ACKNOWLEDGMENTS

This work was supported in part by the 973 Program ofChina under Grants No. 2015CB352502 and 2014CB340303,NSFC Grants No. 61522208, 61379033, 61472348, 61502021,61328202, 61300031, and 61532004, the NSFC-Zhejiang JointFund Grant No. U1609217, the Hong Kong RGC Project16202215, and the Microsoft Research Asia CollaborativeGrant and NSFC Guang Dong Grant No. U1301253.

REFERENCES

[1] G. Cong, C. S. Jensen, and D. Wu, “Efficient retrieval of the top-kmost relevant spatial web objects,” Proc. VLDB Endowment, vol. 2,pp. 337–348, 2009.

[2] J. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Norvag,“Efficient processing of top-k spatial keyword queries,” in Proc.12th Int. Conf. Advances Spatial Temporal Databases, 2011, pp. 205–222.

[3] D. Wu, G. Cong, and C. S. Jensen, “A framework for efficientspatial web object retrieval,” Int. J. Very Large Data Bases, vol. 21,no. 6, pp. 797–822, 2012.

[4] X. Cao, G. Cong, and C. S. Jensen, “Retrieving top-k prestige-based relevant spatial web objects,” Proc. VLDB Endowment,vol. 3, pp. 373–384, 2010.

[5] G. Li, J. Feng, and J. Xu, “DESKS: Direction-aware spatial key-word search,” in Proc. IEEE Int. Conf. Data Eng., 2012, pp. 474–485.

[6] A. Cary, O. Wolfson, and N. Rishe, “Efficient and scalable methodfor processing top-k spatial boolean queries,” in Proc. Int. Conf.Sci. Statist. Database Manage., 2010, pp. 87–95.

[7] I. D. Felipe, V. Hristidis, and N. Rishe, “Keyword search on spatialdatabases,” in Proc. IEEE Int. Conf. Data Eng., 2008, pp. 656–665.

[8] S. Luo, Y. Luo, S. Zhou, G. Cong, J. Guan, and Z. Yong,“Distributed spatial keyword querying on road networks,” inProc. Int. Conf. Extending Database Technol., 2014, pp. 235–246.

[9] D. Wu, M. L. Yiu, G. Cong, and C. S. Jensen, “Joint top-k spatialkeyword query processing,” IEEE Trans. Knowl. Data Eng., vol. 24,no. 10, pp. 1889–1903, Oct. 2012.

[10] B. Yao, M. Tang, and F. Li, “Multi-approximate-keyword routingin GIS data,” in Proc. ACM SIGSPATIAL Int. Conf. Advances Geo-graphic Inf. Syst., 2011, pp. 201–210.

[11] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi, “Collective spatialkeyword querying,” in Proc. ACM SIGMOD Int. Conf. Manage.Data, 2011, pp. 373–384.

[12] C. Zhang, Y. Zhang, W. Zhang, and X. Lin, “Inverted linear quad-tree: Efficient top-k spatial keyword search,” in Proc. IEEE Int.Conf. Data Eng., 2013, pp. 901–912.

[13] T. Lee, J. Park, and S. Lee, “Processing and optimizing main mem-ory spatial-keyword queries,” Proc. VLDB Endowment, vol. 9,pp. 132–143, 2015.

[14] Y. Tao and C. Sheng, “Fast nearest neighbor search with key-words,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 878–888,Apr. 2014.

[15] D. Zhang, K. Tan, and A. Tung, “Scalable top-k spatial keywrodsearch,” in Proc. Int. Conf. Extending Database Technol., 2013,pp. 359–370.

[16] L. Chen, G. Cong, C. S. Jensen, and D. Wu, “Spatial keywordquery processing: An experimental evaluation,” Proc. VLDBEndowment, vol. 6, pp. 217–228, 2013.

[17] X. Cao, et al., “Spatial keyword querying,” in Proc. Int. Conf. Con-ceptual Model., 2012, pp. 16–29.

[18] D. Choi, J. Pei, and X. Lin, “Finding the minimum spatialkeyword cover,” in Proc. IEEE Int. Conf. Data Eng., 2016,pp. 685–696.

[19] Y. Gao, J. Zhao, B. Zheng, and G. Chen, “Efficient collective spatialkeyword query processing on road netowrks,” Trans. Intell. Trans-port. Syst., vol. 17, no. 2, pp. 469–480.

[20] L. Guo, J. Shao, H. Aung, and K. Tan, “Efficient continuous top-kspatial keyword queries on road networks,” GeoInformatica,vol. 19, no. 1, pp. 29–60, 2015.

[21] D. Wu, M. L. Yiu, C. S. Jensen, and G. Cong, “Efficient continu-ously moving top-k spatial keyword query processing,” in Proc.IEEE Int. Conf. Data Eng., 2011, pp. 541–552.

[22] Y. Gao, X. Qin, B. Zheng, and G. Chen, “Efficient reverse top-kboolean spatial keyword queries on road networks,” IEEE Trans.Knowl. Data Eng., vol. 27, no. 5, pp. 1205–1218, May 2015.

[23] J. Lu, Y. Lu, and G. Cong, “Reverse spatial and textual k nearestneighbor search,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,2011, pp. 349–360.

[24] J. Zhao, Y. Gao, G. Chen, C. S. Jensen, R. Chen, and D. Cai,“Reverse top-k geo-social keyword queries in road networks,”Proc. IEEE Int. Conf. Data Eng., 2017, pp. 387–398.

[25] X. Wang, Y. Zhang, W. Zhang, X. Lin, and W. Wang, “AP-Tree:Efficiently support continuous spatial-keyword queries overstream,” in Proc. IEEE Int. Conf. Data Eng., 2015, pp. 1107–1118.

[26] L. Chen, G. Cong, X. Cao, and K.-L. Tan, “Temporal spatial-keyword top-k publish/subscribe,” in Proc. IEEE Int. Conf. DataEng., 2015, pp. 255–266.

[27] L. Chen, J. Xu, X. Lin, C. S. Jensen, and H. Hu, “Answering why-not spatial keyword top-k queries via keyword adaption,” in Proc.IEEE Int. Conf. Data Eng., 2016, pp. 697–708.

[28] R. Campos, G. Dias, A. M. Jorge, and A. Jatowt, “Survey of tempo-ral information retrieval and related applications,” ACM Comput.Surveys, vol. 47, no. 2, 2014, Art. no. 15.

[29] K. Berberich, S. Bedathur, T. Neumann, and G. Weikum, “Flux-Capacitor: Efficient time-travel text search,” in Proc. Int. Conf. VeryLarge Data Bases, 2007, pp. 1414–1417.

Fig. 11. JTABSKQ costs versus k.

Fig. 10. JTABSKQ costs versus # of queries.

CHEN ETAL.: TIME-AWARE BOOLEAN SPATIAL KEYWORD QUERIES 2613

Page 14: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …static.tongtianta.site/paper_pdf/d4ec4372-1e02-11e9-b206-00163e08… · est neighbor search [22], [23], reverse top-k geo-social

[30] A. Anand, S. Bedathur, K. Berberich, and R. Schenkel, “Indexmaintenance for time-travel text search,” in Proc. Int. ACM SIGIRConf. Res. Develop. Inf. Retrieval, 2012, pp. 235–243.

[31] J. He and T. Suel, “Faster temporal range queries over versionedtext,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,2011, pp. 565–574.

[32] E. Manica, C. F. Dorneles, and R. Galante, “Handling temporalinformation in web search engines,” in ACM SIGMOD Rec.,vol. 41, pp. 15–23, 2012.

[33] R. Ahuja, N. Armenatzoglou, D. Papadias, and G. J. Fakas, “Geo-social keyword search,” in Proc. Int. Conf. Advances Spatial Tempo-ral Databases, 2015, pp. 431–450.

[34] J. Rocha-Junior and K. Norvag, “Top-k spatial keyword queries onroad networks,” in Proc. Int. Conf. Extending Database Technol.,2012, pp. 168–179.

[35] T. Joachims, “A statistical learning model of text classification forsupport vector machines,” in Proc. Int. ACM SIGIR Conf. Res.Develop. Inf. Retrieval, 2001, pp. 128–136.

[36] T. M. Chan, “A dynamic data structure for 3-d converx hulls and2-d nearest neighbor queries,” in Proc. Annu. ACM-SIAM Symp.Discrete Algorithms, 2006, pp. 1193–1202.

[37] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction toAlgorithms, 3rd ed. Cambridge, MA, USA: MIT Press, 2009.

[38] S. Vaid, B. Jones, H. Joho, and M. Sanderson, “Spatio-textualindexing for geographical search on the web,” in Proc. Int. Conf.Advances Spatial Temporal Databases, 2005, pp. 218–235.

Gang Chen received the PhD degree in com-puter science from Zhejiang University, China. Heis currently a professor in the College of Com-puter Science, Zhejiang University, China. He hassuccessfully led the investigation in research proj-ects which aim at building China’s indigenousdatabase management systems. His researchinterests range from relational database systemsto large-scale data management technologies.He is a member of the ACM and a senior memberof the CCF.

Jingwen Zhao received the BS degree in com-puter science from Northeast University, China,in 2013. He is currently working toward the PhDdegree in the College of Computer Science,Zhejiang University, China. His research interestsinclude geo-social data processing, spatio-textualdata processing, spatial-temporal databases,and database usability.

Yunjun Gao received the PhD degree in com-puter science from Zhejiang University, China, in2008. He is currently a professor in the College ofComputer Science, Zhejiang University, China.His research interests include spatio-temporaldatabases, metric and incomplete/uncertain datamanagement, geo-social data processing, anddatabase usability. He is a member of the ACMand the IEEE, and a senior member of the CCF.

Lei Chen received the PhD degree in computerscience from the University of Waterloo, Canada,in 2005. He is now a professor in the Departmentof Computer Science and Engineering, HongKong University of Science and Technology. Hisresearch interests include crowdsourcing onsocial networks, uncertain and probabilistic data-bases, Web data management, multimedia andtime series databases, and privacy. He is a seniormember of the IEEE.

Rui Chen received the BS degree in digital mediatechnology from Northeast University, China, in2015. He is currentlyworking toward theMSdegreein the College of Computer Science, ZhejiangUniversity, China. His research interest include spa-tio-textual data processing.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

2614 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2017