authenticated subgraph similarity search

23
Authenticated Subgraph Similarity Search in Outsourced Graph Databases Yun Peng, Zhe Fan, Byron Choi, Jianliang Xu, and Sourav S. Bhowmick Abstract—Subgraph similarity search is used in graph databases to retrieve graphs whose subgraphs are similar to a given query graph. It has been proven successful in a wide range of applications including bioinformatics and chem-informatics, etc. Due to the cost of providing efficient similarity search services on ever-increasing graph data, database outsourcing is apparently an appealing solution to database owners. Unfortunately, query service providers may be untrusted or compromised by attacks. To our knowledge, no studies have been carried out on the authentication of the search. In this paper, we propose authentication techniques that follow the popular filtering-and-verification framework. We propose an authentication-friendly metric index called GMTree. Specifically, we transform the similarity search into a search in a graph metric space and derive small verification objects (VOs) to-be-transmitted to query clients. To further optimize GMTree, we propose a sampling-based pivot selection method and an authenticated version of MCS computation. Our comprehensive experiments verified the effectiveness and efficiency of our proposed techniques. Index Terms—Subgraph similarity search, query authentication, outsourced database Ç 1 INTRODUCTION G RAPHS have been used widely to model complex data in many emerging applications, including proteins in biology, compounds in chemistry, attributed graphs in computer vision, ecology and web topology. In these real applications, subgraph similarity search (or simply simi- larity search) is a query frequently used as there may not be exact match for a user-specified search (e.g., [1], [2], [3], [4], [5], [6], [7], [8], [9]). Similarity search can be for- mally described as follows. Given a query graph q, a graph database D and a threshold (radius) t, retrieve graphs in D whose similarity distances from q are not greater than t. For example, in chemistry, it is well-known that chemical structures discovered by the popular virtual screening method may contain laboratory errors. A compound being searched for may not match any compounds in the database. Hence, practical databases (e.g., PubChem [10]) often return graphs that are similar to the query. Similarity search is known to be an NP-hard problem. The owners of graph databases may lack the IT resources and expertises to provide efficient searches of their data- bases. For example, we issued a small query for a ben- zene structure to the prototype of a recent chemical database [11] and the query took 7.8 minutes. Such a per- formance may not be ideal for many applications. Further, graph data is growing explosively in volume. For instance, recent reports [10] indicate that from 2006 to 2012, PubChem’s compound data increased from 57 to 141 G bytes. It would be inefficient to process such a large amount of data with a commodity machine. For the reasons mentioned above, graph database out- sourcing is appealing to database owners. Specifically, voluminous data is delegated to a powerful third-party service provider (SP). The client may submit queries to the SP as if he/she is accessing a utility and the SP pro- vides query processing services on the data owner’s behalf. Data outsourcing has been adopted in many industry sectors. For instance, in drug engineering, many commercial SPs [12], [13], [14], [15] support outsourcing of pharma databases. Drug laboratories may then focus on the curation of their data. Unfortunately, the service provider may be untrusted and/or compromised by attacks and clients may conse- quently receive tampered results. For instance, Fig. 1 shows an outsourced molecular graph database D, a query mole- cule q and a distance threshold t 0:25. (The edge labels are omitted for brevity of presentation.) Assume that G 4 , G 6 and G 7 are answers which should be returned as the query result. The service provider may deliberately return incor- rect results (e.g., G 3 ), distort t to 0:1, or return partial results (e.g., only G 4 ). These significantly limit the practicality of graph database outsourcing. An authentication mechanism is thus necessary. The majority of works on subgraph similarity search adopts a filtering-and-verification framework (e.g., [1], [2], [3], [4], [5], [6], [8], [9]), which consists of two key phases. First, in the filtering phase, indexes are proposed to prune (or fil- ter) the data graphs that are certainly not the answer. The remaining graphs form a candidate set (a superset of answers). Second, in the verification phase, each candidate is checked by computing its distance from the query to verify if it is an answer. Despite the popularity of the framework, to the best of our knowledge, its authentication has not yet been Y. Peng is with the Research Center of Big Data Application, Qilu Univer- isty of Technology, Jinan, China, and the Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong. E-mail: [email protected]. Z. Fan, B. Choi, and J. Xu are with the Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong. E-mail: {zfan, bchoi, xujl}@comp.hkbu.edu.hk. Manuscript received 18 Jan. 2013; revised 5 Mar. 2014; accepted 1 Apr. 2014. Date of publication 10 Apr. 2014; date of current version 1 June 2015. Recommended for acceptance by J. Pei. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2014.2316818 1838 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015 1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: padmesh-tvn

Post on 09-Sep-2015

236 views

Category:

Documents


1 download

DESCRIPTION

paper

TRANSCRIPT

  • Authenticated Subgraph Similarity Searchin Outsourced Graph DatabasesYun Peng, Zhe Fan, Byron Choi, Jianliang Xu, and Sourav S. Bhowmick

    AbstractSubgraph similarity search is used in graph databases to retrieve graphs whose subgraphs are similar to a given query

    graph. It has been proven successful in a wide range of applications including bioinformatics and chem-informatics, etc. Due to the cost

    of providing efficient similarity search services on ever-increasing graph data, database outsourcing is apparently an appealing solution

    to database owners. Unfortunately, query service providers may be untrusted or compromised by attacks. To our knowledge, no

    studies have been carried out on the authentication of the search. In this paper, we propose authentication techniques that follow the

    popular filtering-and-verification framework. We propose an authentication-friendly metric index called GMTree. Specifically, we

    transform the similarity search into a search in a graph metric space and derive small verification objects (VOs) to-be-transmitted toquery clients. To further optimize GMTree, we propose a sampling-based pivot selection method and an authenticated version of MCS

    computation. Our comprehensive experiments verified the effectiveness and efficiency of our proposed techniques.

    Index TermsSubgraph similarity search, query authentication, outsourced database

    1 INTRODUCTION

    GRAPHS have been used widely to model complex datain many emerging applications, including proteinsin biology, compounds in chemistry, attributed graphs incomputer vision, ecology and web topology. In these realapplications, subgraph similarity search (or simply simi-larity search) is a query frequently used as there may notbe exact match for a user-specified search (e.g., [1], [2],[3], [4], [5], [6], [7], [8], [9]). Similarity search can be for-mally described as follows. Given a query graph q, a graphdatabase D and a threshold (radius) t, retrieve graphs in Dwhose similarity distances from q are not greater than t. Forexample, in chemistry, it is well-known that chemicalstructures discovered by the popular virtual screeningmethod may contain laboratory errors. A compoundbeing searched for may not match any compounds in thedatabase. Hence, practical databases (e.g., PubChem [10])often return graphs that are similar to the query.

    Similarity search is known to be an NP-hard problem.The owners of graph databases may lack the IT resourcesand expertises to provide efficient searches of their data-bases. For example, we issued a small query for a ben-zene structure to the prototype of a recent chemicaldatabase [11] and the query took 7.8 minutes. Such a per-formance may not be ideal for many applications.Further, graph data is growing explosively in volume.

    For instance, recent reports [10] indicate that from 2006 to2012, PubChems compound data increased from 57 to141 G bytes. It would be inefficient to process such a largeamount of data with a commodity machine.

    For the reasons mentioned above, graph database out-sourcing is appealing to database owners. Specifically,voluminous data is delegated to a powerful third-partyservice provider (SP). The client may submit queries tothe SP as if he/she is accessing a utility and the SP pro-vides query processing services on the data ownersbehalf. Data outsourcing has been adopted in manyindustry sectors. For instance, in drug engineering, manycommercial SPs [12], [13], [14], [15] support outsourcingof pharma databases. Drug laboratories may then focuson the curation of their data.

    Unfortunately, the service provider may be untrustedand/or compromised by attacks and clients may conse-quently receive tampered results. For instance, Fig. 1 showsan outsourced molecular graph database D, a query mole-cule q and a distance threshold t 0:25. (The edge labels areomitted for brevity of presentation.) Assume that G4, G6and G7 are answers which should be returned as the queryresult. The service provider may deliberately return incor-rect results (e.g., G3), distort t to 0:1, or return partial results(e.g., only G4). These significantly limit the practicality ofgraph database outsourcing. An authentication mechanism isthus necessary.

    The majority of works on subgraph similarity searchadopts a filtering-and-verification framework (e.g., [1], [2], [3],[4], [5], [6], [8], [9]), which consists of two key phases. First,in the filtering phase, indexes are proposed to prune (or fil-ter) the data graphs that are certainly not the answer. Theremaining graphs form a candidate set (a superset of answers).Second, in the verification phase, each candidate is checkedby computing its distance from the query to verify if it is ananswer. Despite the popularity of the framework, to thebest of our knowledge, its authentication has not yet been

    Y. Peng is with the Research Center of Big Data Application, Qilu Univer-isty of Technology, Jinan, China, and the Department of Computer Science,Hong Kong Baptist University, Kowloon Tong, Hong Kong.E-mail: [email protected].

    Z. Fan, B. Choi, and J. Xu are with the Department of Computer Science,Hong Kong Baptist University, Kowloon Tong, Hong Kong.E-mail: {zfan, bchoi, xujl}@comp.hkbu.edu.hk.

    Manuscript received 18 Jan. 2013; revised 5 Mar. 2014; accepted 1 Apr. 2014.Date of publication 10 Apr. 2014; date of current version 1 June 2015.Recommended for acceptance by J. Pei.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2316818

    1838 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

    1041-4347 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • studied and this paper takes the first step toward an authenti-cated framework for the search.

    To facilitate the technical discussions, we briefly list themain steps of query authentication [16]: the data ownerpublishes its database, index and signature to an SP. TheSP processes queries from a client and returns to the clientboth the query result and a verification object (VO) whichoften encodes query processing traces such as index traver-sals. Using the query result and the VO, the client constructsthe digest of the database/index and compares it with thesignature of the data owner to authenticate the query result.

    As the filtering-and-verification framework is not spe-cially designed for query authentication, we note that anave application of existing query authentication techni-ques leads to at least three problems. First, no previousindex specifically considered whether the candidategraphs were located together in the graph database, whichdirectly affects the VO needed. For instance, candidate andnon-candidate graphs may be alternately stored in thedatabase; and in this scenario, each candidate graph needsan item in the VO to authenticate that no candidate hasbeen missed. As the number of candidate graphs for simi-larity search can be large, the VO for authenticating themcan also be large. Second, one performance bottleneck atthe client side is the distance computation on large candi-date graphs, since the distance computation time is expo-nential to the graph size. Unfortunately, most existingapproaches index similarity search by features or sub-graphs (e.g., [1], [2], [4], [5], [8], [9]). The larger the graphis, the more features/subgraphs are there for indexing.Thus, large graphs are often included in candidate graphs.Third, clients are required to perform the costly subgraphsimilarity computation numerous times in order toauthenticate the processing traces at the SP. Since suchcomputation has already been done once at the SP, it isinefficient for the client to redo it from scratch.

    In this paper, we propose an authentication-friendly met-ric-based index, called Graph Metric Tree (GMTree), toaddress the aforementioned technical challenges. Its novel-ties mainly rely on the authentication techniques associatedwith GMTree. Specifically, for the first problem, we trans-form the subgraph similarity search into a search of a graphmetric space and exploit the triangle inequality. Traditionalmetric indexes (such as [17], [18], [19]) can then be adaptedto index graphs. GMTree is designed based on vp-tree [19],while other metric indexes can be adopted with minor mod-ifications. As candidate graphs are often located together,our GMTree needs a notably smaller VO than a baselinederived from a previous work Grafil (denoted as Grafil) asverified by our experiments. Moreover, subgraphs of datagraphs can be pivots of GMTree. In contrast, the pivots ofprevious metric indexes were atomic data. We propose a

    pivot selection technique that exploits this property. Ourexperiments show that our pivot selection method reducesthe query time and VO size by a factor of about 6.5 and 2,respectively. For the second problem, we derive an upperbound for pruning large non-answer graphs. We exploit thetriangular inequality of a metric function and there is oftena large graph distance between large non-answers andsmall queries. In particular, our experiments show that thelargest 5 percent of GMTrees candidate graphs are 1.5 timessmaller than those of Grafil and the clients time spent onauthenticating candidate graphs is thus reduced by up to35 percent. For the third problem, we propose an authenti-cated version of the state-of-the-art MCS computation tech-nique [20]. When the SP determines the MCSs between thequery and indexed graphs, it records some hints in the VO.The correctness of the hints can be authenticated by a scanon the hints. This significantly reduces the authenticationtime at the client. Our experiment shows that the VO over-head of our authenticated MCS computation is about10K bytes but it reduces the authentication time at the clientby about 50 percent.

    The main contributions can be summarized as follows.

    We cast the subgraph similarity search into a similar-ity search in a graph metric space.

    We adopt a metric index to form GMTree to supportefficient authentication. We propose the VO defini-tion on GMTree, the VO construction and itsauthentication.

    We propose a sampling-based pivot selectionmethod to optimize the GMTree construction.

    We develop an authenticated subgraph similaritycomputation method that significantly reduces theclients authentication time.

    We conduct an extensive experimental evaluation ofproposed techniques. Our experiments on a realdata set AIDS show that the authentication time ofGMTree makes up less than 20 percent of its querytime. The VO generated from GMTree is well con-trolled under 30K bytes, less than 3 percent of anauthentication method extended from Grafil withbasic authentication techniques.

    Organization. The rest of the paper is organized as follows:

    Section 2 discusses the related work. Section 3 presents the

    background and problem statement. The metric based prun-

    ing is studied in Section 4. We present the GMTree index in

    Section 5 and its authentication techniques in Section 6.

    Section 7 details authenticated MCS computation and pivot

    selection techniques. Our experiments are presented in

    Section 8. Section 9 concludes the paper. All proofs are pre-

    sented in the appendix.

    2 RELATED WORK

    There have been many studies on authentication of variouskinds of queries [21], [22], [23], [24], [25], [26] but thesequeries are significantly different from subgraph similaritysearch. Despite recent interest in graph databases, therelated work on its authentication is very limited. Yiu et al.[27] propose to authenticate shortest path queries on roadnetworks. Goodrich et al. [28] study the authentication of

    Fig. 1. An example of an outsourced graph database.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1839

  • path and connectivity queries on general graphs. Subgraphsimilarity search is different and more computationallycostly than these queries. Moreover, the ordering of dataobjects in road networks can be precomputed offline by net-work-based distance, for example. Such an ordering isabsent in graph data. Kundu and Bertino [29], [30] verifywhether a given subgraph/subtree is in fact a subgraph/subtree of a given large data graph/tree without leakage ofstructural information. However, we study a large graphdatabase instead of one large graph/tree. Martel et al. [31]propose a search directed acyclic graph (DAG) which is ageneric model for authenticating a broad class of data struc-tures. However, subgraph similarity search is more than aDAG search.

    Another related topic is subgraph similarity search. Heand Singh [6] propose a CTree to index the hierarchicalgraph closure but CTrees heuristic method only supportsapproximate similarity search. Williams et al. [5] and Tianet al. [8] propose graph decomposition methods to enumer-ate all unique and k-size subgraphs of data graphs, respec-tively; if adopted, they may lead to large VOs. Yan et al. [1]propose a MCS-based similarity search method, whichprunes non-candidate graphs by the count of structural fea-tures. Mongiovi et al. [9] extends it by incorporating withthe identity of features for more filtering power. Jiang et al.[32] and Shang et al. [2] address the practical interests ofconnected subgraphs. Recently, Yuan et al. [4] study theMCS-based similarity search on probability graphs. Theirsimilarity measures are based on maximum common edgesubgraph (MCES). However, they are not a metric function.Zhu et al. [3] propose a MCES-based similarity metric mea-sure. Their similarity value is dominated by the answergraph size when the answer graph is much larger than the querygraph. In contrast, our similarity measure is defined to berelative to the query graph size. Further, we note that recentworks adopt the maximum common induced subgraphbased similarly, especially for biological and chemical appli-cations [33]. There are some existing works on large graphs.Tian and Patel [34] propose a neighborhood-based index,but they only support approximate search. Khan et al. [35]propose a novel neighborhood-based similarity measure toreduce isomorphism testing. However, they require allnodes of a query graph matched to a data graph. The simi-larity measure is not a metric either. There is another streamof works (e.g., [6], [36]) on exact subgraph queries followingthe filtering-and-verification framework but there has beenno work on its authentication. Moreover, similarity searchis more flexible than exact queries.

    The MCS computation problem is mainly studied for themaximum common induced subgraph (MCIS), as reportedin [37], [38], [39]. Induced subgraphs are also widely usedin recent subgraph similarity search works (e.g., [2], [5],[34], [40]). We adopt MCIS for its metric properties [41].

    For similarity search in metric spaces, several searchtrees are proposed to index the metric space (e.g., [19] and[17]). However, these indexes are designed for neitherindexing graphs nor their authentication. As noted inSection 1, several technical challenges must be addressedwhen they are adopted. Other works (e.g., [42]) propose tocast metric spaces into low dimensional vector spaces andconduct similarity search there. However, it is not clear

    how graphs can be represented in vectors for similaritysearch and authentication.

    3 BACKGROUND AND PROBLEM STATEMENT

    This section first provides some background and problemstatement and then presents the overview of our approach.

    3.1 Backgrounds

    3.1.1 Background for Querying Graphs

    In this paper, we study undirected labeled data graphs, or sim-ply graphs in the subsequent discussions. A graph isdenoted as G V;E;S; , where V is a set of nodes,E : V V is a set of edges, S is a set of labels and is a func-tion mapping a vertex or edge to a label. The size of G isdefined as the number of vertices in G, denoted as jGj. Fol-lowing a popular stream of works, we consider a a graphdatabase as a large collection of graphs having hundreds of nodes,which are common in chemical and biological databases.

    Isomorphism and common induced subgraph. Two graphs Gand G0 are isomorphic if there is a bijection f between Vand V 0 such that for every vertex u 2 V , fu 2 V 0 andu 0fu, and for every edge u1; u2 2 E, fu1;fu2 2 E0 and u1; u2 0fu1; fu2. A common(induced) subgraph between two graphs GV;E andG0V 0; E0 is an (induced) subgraph S of G that is isomor-phic to an (induced) subgraph S0 of G0. In general, the com-mon subgraph and the common induced subgraph could bedifferent. In this paper, the maximum common induced sub-graph is referred to as the MCS between them, denoted asmcsG;G0. Note that mcsG;G0 is not necessarily uniquefor two given graphs G and G0. Moreover, mcsG;G0 canbe connected or disconnected.

    Example 3.1. Consider the graph G7 and the query q in ourrunning example Fig. 1. Fig. 2a presents one commonsubgraph between G7 and q, S fa; b; c; eg (highlightedby bold circles and lines). However, S is not an inducedcommon subgraph because S is not an induced subgraphof q as e; b 2 q. In comparison, two common inducedsubgraphs, connected S0 fa; b; cg and disconnectedS00 fa; c; eg are presented in Fig. 2b (highlighted bybold circles and lines, and gray filling, respectively).Since G7 and q have no common induced subgraphlarger than 3, S0 and S00 are two MCSs of them.

    Graph similarity. There have been various similarity defi-nitions of graphs in the literature. Among the existing defi-nitions, only MCS-based graph similarity measure is a metricdistance function. Specifically, the graph distance between aquery graph q and a data graph G is defined as follows:

    Definition 3.1. Given a query graph q and a data graph G, thegraph distance between q and G is

    Fig. 2. Common subgraph versus common induced subgraph.

    1840 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • dq;G 1 jmcsq;Gjmaxfjqj; jGjg : (1)

    This graph distance is a metric function [41], which satisfiesthe following properties, for any two graphs G and G0,

    dG;G0 0; (positiveness) dG;G0 0, iff G and G0 are isomorphic dG;G0 dG0; G (symmetry) dG;G0 dG0; G00 dG;G00 (triangle inequality).

    Subgraph similarity. Graph distance is related to the ratio

    between jmcsq;Gj and the larger one of jqj and jGj. How-ever, jGj is often much larger than jqj. In this case, dq;G pro-duces a number close to 1, even when q itself is already asubgraph of G. Hence, users may often be more interested inthe relative size of mcsq;G with respect to q, e.g., [1], [2].Computing the relative size of mcsq;G with respect to q isequivalent to comparing the similarity between q and Gs sub-graphs whose sizes are not larger than q. This is often referredto as subgraph distance, presented in Definition 3.2.

    Definition 3.2. The subgraph distance between a query graph qand a data graph G is

    dsq;G 1 jmcsq;Gjjqj : (2)

    The number of missing vertices of q, s jqj jmcsq;Gj,can be readily derived from subgraph distance of q and G.Unfortunately, subgraph distance is not a metric function.

    Example 3.2.We use Fig. 2b to show the difference betweenthe graph distance and the subgraph distance. In Fig. 2b,because jmcsq;G7j 3, so dq;G7 1 3=5 2=5 anddsq;G7 1 3=4 1=4.

    Definition 3.3. (Subgraph Similarity Search Problem).Given a graph database D fG1; G2; . . . ; Gng, a query graphq and a threshold t, the subgraph similarity search problemis to retrieve all the graphs Gi 2 D, such that dsq;Gi t.

    3.1.2 Background for Query Authentication

    One-way hash function. A one-way hash function, denoted ash, is easy to determine a hash value hm from a givenpre-imagem. It is hard to invert a given hash value of a ran-dom pre-image. Examples of such functions are MD5 andSHA. In this paper, we often use the term digest to refer to hashvalue.

    Public-key digital signature scheme. The scheme involves apublic key and a private key. Only the data signer has a pri-vate key and can generate digital signatures of messages.The public key is known to everyone and the public mayuse it to verify the integrity of the signatures. For instance,RSA is a popular public-key digital signature scheme.

    Merkle hash tree (MHT). The seminal work of Merkle HashTree [43] has been adopted in many authentication works.MHT is a binary search tree and hash values are associated toits nodes. A leaf node contains data value dv and its hashhdv is associatedwith . For an internal node, MHT associatesthe hash, denoted as hh1kh2, of the hash of its children h1and h2, where k denotes concatenation. A data owner signsthe hash of the root of MHT. In a nutshell, given a range querya; b, the query service provider transmits to clients (i) the

    answers in a; b and (ii) the hash values of the index nodes atthe boundary of the search of a; b of MHT. The answers arecomplete only if the hash of the root computed by the clientagrees with the signature provided by the data owner.

    3.2 Problem Formulation

    System model. We assume the current state-of-the-art systemmodel of database outsourcing [16]. The system model com-prises three parties: the data owner DO, the service providerSP and the client.

    The DO maintains a database D. To support subgraphsimilarity search on D, the DO or the SP builds an index Tof D. The DO signs the root of T , and outsources D, T andthe signature to the SP.

    The SP answers queries of a client on DOs behalf andreturns the result set RS back to the client. Since the SP maynot be trusted, the SP is required to return the results RStogether with their verification objects VOs.

    The client uses RS and VO to synthesize the digest of theroot of T . By using the public key of DO, the client verifieswhether the digest agrees with the signature provided byDO. In addition, the clients are required to verify thefollowing.

    (i) Soundness: for 8G 2 RS, G is an answer and G 2 D;(ii) Completeness: for 8G 62 RS, G is not an answer.Threat model. In our model, the SP is not always trustable.

    It may be a potential adversary or subverted by attackers. Ineither case, the SP may alter the data or the index, tamperwith the similarity threshold, return partial answers, orabort the computation. We consider an authenticationframework is secure if attacking such a framework is as hardas inverting a one-way hash function or breaking the pub-lic-key digital signature scheme.

    Problem Statement. Given the above system and thread models,we seek an efficient authentication mechanism where the clientcan issue a subgraph similarity search and verify the soundnessand completeness of the results returned by a query serviceprovider.

    3.3 Query Paradigm and Overview of Our Method

    A popular subgraph similarity search paradigm is the filter-ing-and-verification framework [1], [2], [3], [4], [5], [6], [8], [9].As an example, Fig. 3 shows an overview of query process-ing of Grafil [1]. The query q and distance threshold tare first processed with the database in the filtering phase.Grafil filters non-answer graphs by its index and obtains acandidate set (i.e., a superset of answers). In the verificationphase, it computes the subgraph distance to q for each can-didate graph. The candidates whose distances to q do notexceed t are query answers.

    We design our authenticated subgraph similaritysearch technique by following the well-received filtering-and-verification framework. Fig. 4 gives an overview of

    Fig. 3. Sketch of the filtering-and-verification framework.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1841

  • our authenticated subgraph similarity search techniques.

    1 The data owner DO indexes the graph database D withthe GMTree T (to be detailed in Section 5). The DO signs Tand passes the database D and the index T to the serviceprovider SP together with the signature sroot. 2 The clientissues a query graph q with a distance threshold t to theSP. 3 In the filtering phase, the SP performs a traversalon the GMTree. Subtrees that do not contain any answersare pruned by using the condition derived from graph dis-tance. The data graphs that are indexed by the remainingsubtrees form the candidate set Cq. 4 For the sake ofauthentication, the SP introduces the visited nodes intoHv and puts the digests for pruned subtrees in Hb whereHv, Hb and T s signature sroot form the VO for the indexdenoted as VOindex. 5 In the verification phase, for eachG 2 Cq, the SP applies the authenticated MCS method tocompute dsq; G. If dsq; G t, G is an answer and the SPstores the mapping of mcsq; G in a list M; Otherwise, Gis not an answer and the SP adds the structures for com-puting MCS at the client into a list N . M and N comprisethe VO for the candidates denoted as VOcand. 6 The queryresult RS and VO are returned and the client performs theauthentication using the VOindex and the VOcand.4 METRIC BASED FILTERING

    Among the studies that follow the filtering-and-verificationframework, in this paper, we adopt the metric-based filter-ing approach as the metric properties can be exploited toaddress the authentication of subgraph similarity search.

    Lemma 4.1. Let q, G1 and G2 denote the query graph and twodata graphs, respectively. Given a similarity threshold t, if

    1 jmcsG1;G2jjG1j dq; G1 > t, then dsq; G2 > t.Lemma 4.1 states that subgraph similarity between a

    graph and a query can be expressed in terms of graph simi-

    larity and 1 jmcsG1;G2jjG1j . However, 1jmcsG1;G2j

    jG1j is not ametric function either. We address this by introducing anotion of augmented graphs.

    Definition 4.1. An augmented graph G1 with respect to G2 isdefined as follows:

    G1 G1 if jG1j jG2j; Otherwise, G1 G1 [A, where A is an augmented

    subgraph having nodes and edges with labels neveroccurred in G2 and possible queries, until jG1j jG2j.

    Example 4.1. Consider the graphs G1, G2 and G3 in Fig. 1.Fig. 5a shows the augmented graph G1 of G1 w. r. t. G2,where G1 G1 as jG1j> jG2j. Fig. 5b shows the aug-mented graph G1 of G1 w. r. t. G3, where G

    1 G1[A and

    A is an augmented subgraph of two nodes, denoted bythe dashed circles and lines.

    We can establish our pruning condition Theorem 4.2 byapplying Lemma 4.1.

    Theorem 4.2. Given a query q, an augmented graph G1 and agraph G2, if dG1; G2 dq;G1 > t, then dsq;G2 > t.We remark that the augmented graph G1 is virtual,

    which does not need to be materialized in indexing, i.e.,

    original graph G1 can still be used. To support the pruning

    condition of dsq; G2 by Theorem 4.2, dq; G1 anddG1; G2 are needed. Regarding dq; G1, we store jG1j inindex, which can be easily computed by Definition 4.1

    without materializing G1. Since mcsG1; q mcsG1; q,we can compute dq; G1 1 jmcsG1;qjmaxjqj;jG

    1j by computing

    mcsG1; q on-the-fly. Regarding dG1; G2, it could beindexed in our index.

    Since we have transformed the similarity search into asearch in graph metric space U; d, where U denotesthe graph database and d is the graph distance defined inDefinition 3.1, one may attempt to directly adopt the tradi-tional metric index to index the graph metric space. Forexample, vp-tree [19] partitions the metric space by usingpivots with certain radii (akin to the MBRs of the RTree). Aninternal node of vp-tree contains a pivot p with a radius rp.The pivot divides U; d into two subspaces: (i) the subspacecovered by the pivot Up fO 2 U j dO; p rpg; and (ii) theremaining subspace Up fO 2 U j dO; p > rpg. The sub-spaces are recursively partitioned and indexed with sub-trees. Pivots can be simply selected from data graphs. Thisis regarded as the baseline method and will be compared inour experiments.

    5 GRAPH METRIC TREE

    As motivated in Section 1, existing techniques on similaritysearch are not designed for authentication and may encoun-ter several problems when adopted. In this section, we pro-pose an index called Graph Metric Tree, which forms thebasis of our authentication algorithm. This section focuseson similarity search and the details for authentication arepresented in Section 6.

    5.1 GMTree Structure

    GMTree is designed based on a variant of vp-tree [44],where a metric space is partitioned into a collection of non-overlapping circular subspaces with the same center.

    Fig. 4. An overview of our authentication method.

    Fig. 5. Examples of augmented graphs.

    1842 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • Definition 5.1. Given a graph metric space U; d, where U fG1; . . . ; Gng and d is the graph distance (Definition 3.1), apivot p is a graph, where p 2 D [ S and S fS jS Gi;Gi 2 Dg. Given p, U is partitioned into c circular non-over-lapping subspaces with radius r0p; . . . ; r

    c1p as follows:

    U0: fG j r0p dp; G < r1p; G 2 Ug, where r0p 0; Ui: fG j rip dp; G < ri1p ; G 2 Ug, for 1 i bjqj, whereb 2 0; 1 is a user-defined parameter whose value is closeto 1, the client simply enumerates the subgraphs of q ofsize jmcsq; Gj 1. If (i) no such subgraph is contained inG; and (ii) the mcsq; G is a common subgraph of q andG, then the mcsq; G is authenticated. Otherwise, weadopt auth-VC-mcs for other cases, as Enum is generally not

    efficient to handle OCjqj=2jqj subgraphs. Moreover, b canalso be tuned between the authentication time and VOsize, as Enums hint is smaller than auth-VC-mcs.

    7.2 Pivot Selection

    To optimize the performances of GMTree, we propose a sam-pling approach to select pivots for a GMTree that produces asmall candidate set. Existing works on pivot selection oftenadopt to minimize some cost function in a heuristic manner.In this paper, we adopt the min max covering radius heuris-tic as it produces the fewest candidates as reported in [17].

    A unique property of our graph data is that they can bedecomposed into subgraphs, whereas the data objects in theconventional metric space are assumed atomic. In our pre-liminary experiments, we observe that using subgraphs aspivots could produce smaller costs than using the datagraphs. Therefore, in this paper, we are going to considerthe subgraphs in our pivot selection. Our min max coveringradius heuristic is formally presented as follows.

    Definition 7.2. Given a set of graphs D fG1; G2; . . . ; Gmg andthe set of subgraphs S fS jS G;G 2 Dg, select one graphp as pivot from D [ S to minimize the cost function

    X maxi1...m

    fdp; Gig; (3)

    where p is the augmented pivot constructed from p as dis-cussed in Section 4.

    The pivot selection problem is NP-hard. Due to space limi-tations, we present the proof in Appendix 4.8. For practicalconsiderations, we propose a sampling method to select pivots.

    Sampling approach. We randomly sample s graphs fromthe pruned search space. This can be established by firstrandomly choosing data graphs from D and then randomlydeleting some vertices from the chosen data graphs. We tryall sample graphs and compute the costs. The minimumcost found in this way is the sample minimum, denoted asa^. Note that a^ is a random variable. We use a^ to approximatethe population minimum, denoted as a. Suppose b is thepopulation maximum. Then, a^ 2 a; b. Since D [ S is finite,a and b exist and are finite. By the definition of graphdistance, 0 a a^ b 1. Suppose the graph distancebetween any two graphs is uniformly distributed in a; b.Let bound the error of a^ from a. Then, for any 0 < < b a, we have

    Pra^ a 1 1 s: (4)

    Formula (4) states that the sample minimum a^ is veryclose to the population minimum a, i.e., a^ a is bounded byan arbitrarily small number , with a high probability. Spe-cifically, 1 1 s grows exponentially with respect to thesample size s. For example, when 0:1, just 30 randomsamples can guarantee a^ a < 0:1 with probability largerthan 95 percent.

    Determining pivot size. The sampling may select large piv-ots to minimize the number of candidates, but the large piv-ots will decelerate the GMTree traversal. We propose a simplecost model to quantify these. We denote the cost at a certainpivot size as Cost. Cost consists of two costs: (i) the numberof the maximum matchings jJ j and the minimum vertex

    Fig. 12. The process of auth-VC-mcs:

    1848 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • covers jVCj in the MCS computation of pivots, as theauth-VC-mcs needs a scan on these structures; and (ii) thenumber of candidates

    Cost jJ j jVCj k jCqj; (5)k denotes the relationship between jJ j jVCj and jCqj,which is determined by issuing many random queries.

    To choose an optimal pivot size using our model, we cansimply increase the pivot size and stop once DCost isobserved larger than zero. To obtain the Cost for each pivotsize, we apply the sampling method on the graphs of thatsize in D [ S to determine the pivots.

    The time complexity of GMTree construction with thesampling-based pivot selection is OlogcjDj 1 s jGj dist jDj, where dist is the time complexity ofVC-mcs in Section 7.1.1. For presentation brevity, we providethe detailed analysis in Appendix A.9.

    8 EXPERIMENTAL EVALUATION

    In this section, we present an extensive experimentalevaluation that verifies the efficiency of our proposedtechniques and the effectiveness of our optimizations. Weperformed an experimental comparison with the baselinemethod in Section 4 and Grafil. Grafil is extended fromthe seminal subgraph similarity search method Grafil [1]with a basic authentication. In particular, the edge-basedsimilarity definition of Grafil has been modified and hasbeen replaced by Definition 3.2. Regarding the baselinemethod, we used data graphs as pivots and turned off allthe proposed optimizations. Regarding Grafil, we builtMHTs on Grafils matrix index for authentication. Specifi-cally, we built the MHT on the column vectors of thematrix, as Grafil uses its matrix in columns.

    Experimental settings. We ran our experiments on aserver with a Dual 6-core 2.66 GHz CPU running CentOS5.6. Our implementation was written in Java with JDK1.6. The maximum memory for our Java Virtual Machinewas set to 4G bytes. We used an external graph isomor-phism library VFLib [47] and the state-of-the-art MCS com-putation library CCP4 [48]. SHA and RSA were used as ourcrypto signing schemes.

    Benchmark data sets. We used a real data set AIDS as in [1],[2], [3], [6] and a synthetic data set SYN [49] in our evaluation.Since PubChem [10] is used in our motivating examples, wetested our algorithms on a PUBCHEM data set in Appendix D.Our AIDS data set was provided by Yan et al. [1], which con-tains 10,000 molecular graphs. Our synthetic data generatorwas provided by Cheng et al. [49]. The statistics of the datasets are presented in Table 2 in Appendix C.

    Query workload. Similar to [1], [2], we tested query graphsof different sizes: Q4, Q8 and Q12. Qm;m 4; 8; 12 meansthat the query graph has m vertices. Qm is also a mixture ofqueries of different densities. Each Qm contains 1,000 querygraphs. Following [2], the possible numbers of missing ver-tices s of q are 1, 2 and 3. Due to space restrictions, we pres-ent the complete description in Appendix C.

    Our experiments present the performances of ourtechniques on the SPs query time, the clients authenti-cation time and the VO size, respectively. The reportedperformances were averaged performances of our techniques on

    1,000 queries in each query set. VOs were simply com-pressed by the gzip package in JDK 1.6. To study theGMTrees characteristics under a wide range of fanouts,our experiments were conducted with fanouts of 22, 24

    and 26, unless otherwise specified.

    8.1 Comparison with the Baseline and Grafil

    This experiment compares the performances of GMTree withthe baseline and Grafil in terms of the VO size and the timeon AIDS and SYN, respectively.

    Comparison of VOindex size. Figs. 13a and 13b compare theVOindex size on AIDS and SYN, respectively. Fig. 13a presentsthat our VOindex is only 3 percent of Grafil. The reason isthat the column vectors of the matrix of Grafil are highdimensional as the matrix often has thousands of rows andMHTs on high dimensional data usually produces large VOs.In addition, since the candidate graphs of our GMTree arelocated near to each other, we only needs about 6 percentdigests of Grafil to authenticate the traversal boundary. Wedo not show Grafil on SYN because that the feature extrac-tion on SYN cannot finish due to its high density. Figs. 13aand 13b also present that our VOindex is clearly smaller thanthat of the baseline as the filtering of the baseline is poor.

    Comparison of VOcand size. Figs. 13c and 13d compare theVOcand size on AIDS and SYN, respectively. From Fig. 13c, we

    Fig. 13. Comparison of GMTree with Grafil and the baseline in terms ofVO size and time on AIDS and SYN.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1849

  • note that our VOcand is in the same order of magnitude asGrafil, since we have comparative numbers of candidategraphs. Specifically, forQ8 on AIDS, the precision of GMTree (i.e., jRqj/jCqj) is 0.7K/4.7K, 3.4K/4.9K and 4.6K/6.6K fors 1; 2; 3, respectively. In comparison, the precision ofGrafil is 0.7K/5.8K, 3.4K/7.7K and 4.6K/8.5K for s 1; 2; 3,respectively. We highlight that the similarity search bynature has large candidate sets and that is why we optimizethe MCS computation. While the size of GMTree is in the sameorder of magnitude as Grafil (on AIDS, GMTree is 4.9M bytesand Grafil is 1.6M bytes), the candidates of GMTree aresmaller than those of Grafil, by using Lemma 1 on GMTreesleaf nodes. In particular, the average size of the top 5 percentlargest candidates of GMTree is 1.5 times smaller than that ofGrafil. Figs. 13c and 13d also show that our VOcand is about40 percent smaller than that of the baseline.

    Comparison of query time. Figs. 13e and 13f compare theSPs query time on AIDS and SYN, respectively. Fig. 13eshows that our GMTree outperforms Grafil. In particular,our GMTree is faster than Grafil by a factor of 2 when s 1.Figs. 13e and 13f also presents that the GMTree is about40 percent faster than the baseline on average.

    Comparison of authentication time. Figs. 13g and 13h com-pare the authentication time on AIDS and SYN. Fig. 13gpresents that the GMTree is at least 23 and 28 percent fasterthan Grafil and the baseline, respectively, on AIDS. Fig. 13hshows that GMTree is at least 44 percent faster than the base-line on SYN.

    8.2 Authenticated Query Overhead

    In this experiment, we show the performance breakdown ofauthenticated and unauthenticated query in Fig. 14.Figs. 14a and 14b show that the overhead needed to supportauthenticated queries is about 100 percent of the time ofunauthenticated query. However, Fig. 14a shows that theclients time is much smaller than the unauthenticated one.Fig. 14b shows further the comparison on data graphs withvarious densities. We observe that GMTree is more efficientand effective on sparse graphs as the distance computationrequested increases significantly as the density increases. (Itis also observed by [2], [8].) This is consistent with Fig. 14athat GMTree is efficient on real data sets since chemicalmolecular graphs are often sparse (as reported in e.g., [1],[46]). GMTree is more efficient than the state-of-the-art tech-nique Grafil on the real data set (Fig. 13). Finally, weremark that the SP is often equipped with powerfulmachines, which is not the case here.

    8.3 Experiments on Query Size

    In this experiment, we use the AIDS and SYN data sets andtheirQ4,Q8 andQ12 query sets. To study the effect of query

    size on our techniques, we fix s to 1. We vary s in laterexperiments. Fig. 15 shows the SPs query time, the clientsauthentication time and the overall VO size, respectively. Inthe following, we simply use the terms query time andauthentication time for short, if it is clear from the context.

    Query time. From Figs. 15a and 15b, we note that thequery time increases significantly with the growth of querysize. This is consistent to the complexity of MCS computa-tion. We also note that the query time reduces as the fanoutincreases. For example, on AIDS, the query time of Q8 at fan-out 64 is 41 percent of that at fanout 16. This is because thatthe larger fanout, the less MCS computation.

    Authentication time. Figs. 15c and 15d present that theauthentication time is much smaller than the query time. Inparticular, on AIDS, the authentication time ofQ4 at fanout 16is about 1/2 of its query time, 1/6 for Q8 and 1/8 for Q12,respectively. This is mainly for three reasons. (i) Regardingour Enummethod, if the size of the MCS between q and pivot pis declared to be a, the client only needs to decide whether qhas a subgraph of size a 1 of p, while in query evaluationthe SP may need to determine all subgraphs of sizes froma 1 to jqj. (ii) Regarding our vertex-cover based method,the client needs not to compute the maximum matchingsand the minimum vertex covers on-the-fly. (iii) Checkingmapping for answer graphs is efficient.

    Figs. 15c and 15d also present that the authenticationtime first reduces slightly and then increases significantlywith the growth of query size. This is because that when thequery size increases, the number of candidate graphsdecreases, but each candidates authentication timeincreases. The slight reduction from Q4 to Q8 is becausethat the reduction of candidates has slightly more impactthan the increase of each candidates authentication time. Inparticular, at fanout 16 on AIDS, the candidate numberreduces about 4K from Q4s 9K to Q8s 5K, but each

    Fig. 14. Authentication overheads on AIDS and SYN.

    Fig. 15. Experiments on query size on AIDS and SYN.

    1850 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • candidates authentication time only increases 2:4 ms from4:8 to 7:2 ms. The time is averaged from 1,000 queries. Thefollowing notable increase from Q8 to Q12 is because thateach candidates authentication time increases much fasterthan the reduction of the number of candidates. In particu-lar, at fanout 16 on AIDS, the number of candidates decreasesonly about 20 percent from Q8s 5K to Q12s 4K, but eachcandidates authentication time significantly increasesabout 10 times from 7:2 to 67 ms.

    VO size. Figs. 15e and 15f show the total VO size. Weobserve that VO size decreases with the growth of querysize. It is because that VO size is dominated by the candidategraphs and larger queries result in fewer candidate graphs.We also observe that the VO size increases as the fanoutincreases, because that the larger the fanout, the more candi-dates produced. It is consistent to the observations ofMRTree and MBTree.

    8.4 Effectiveness of Optimizations on GMTree

    In this experiment, we focus on the AIDS data sets, as otherdata sets exhibit similar performance characteristics.

    Pivot selection. Figs. 16a and 16b present the pivot sizedecision and the sample number decision in our pivotselection, respectively. The fanout of GMTree is 16 and s is1, in this experiment. The x-axis of Fig. 16a is pivot sizes;the left y-axis is the percentage of candidates out of thewhole database and the right y-axis is the query time.Fig. 16a shows the balance between the number of candi-dates and the query time on diverse pivot sizes. Note thatthe query time here is linear to jJ j jVCj. The reducingline shows that the number of candidates reduces with thegrowth of pivot size, whereas the increasing line indicatesthat the query time increases. Fig. 16a shows the optimalpivot size is 10 that is consistent to our pivot size deci-sion model in Section 7.2. In Fig. 16b, the x-axis is samplesizes; and y-axis is the percentage of candidates out of the

    whole database. The line in Fig. 16b converges after10 samples. This means that we only take a small numberof samples to obtain stable performance, which is consis-tent to our analysis in Section 7.2.

    Authenticated MCS computation. To clearly illustrate theperformance of our authenticated MCS computation betweenq and pivots, the query time in this experiment does notinclude the time to handle candidate graphs, and the VO isjust that caused by pivots. The fanout of GMTree is 16 and sis 1, in this experiment. Figs. 16c and 16d show the querytime and the VO size of our auth-VC-mcs, respectively. Recallthat if the declared MCS size is larger than b jqj, we use theEnum method; otherwise, the VC-mcs method. Fig. 16cpresents that the query time first reduces and then increaseswith the growth of b. This shows the balance between theEnum and the VC-mcs: (i) the smaller b, the more Enum is used;and (ii) the Enum is more efficient for large MCSs, whereasVC-mcs is better for small MCSs. Fig. 16d shows that the VOsize increases with the growth of b. This is because that theVO needed by the Enum is smaller than that of the VC-mcs.

    Figs. 16e and 16f show the effectiveness of ourauth-VC-mcs. Fig. 16e shows that while taking 12 percentmore query time of SP, the auth-VC-mcs saves 40 percent cli-ents authentication time. It is desirable in practice as the SPoften has an advanced computation ability, whereas the cli-ent does not. Fig. 16f shows that the VO overhead of theauth-VC-mcs is 10K bytes, which is just 2 percent of the over-all VO.Remarks. Our experiments reveal the characteristics ofGMTree with respect to its parameters. The SP can easilyoffer performance options such as query time optimized,balanced, and network optimized for data owners/clientsto choose from and build the corresponding indexes offline.

    9 CONCLUSION

    This paper studies the authenticated subgraph similaritysearch in outsourced graph databases. We transform thesubgraph similarity search into a search in a graph metricspace and propose GMTree. Our novelties and technicalitiesreside in the authentication techniques. First, candidategraphs determined by GMTree are often localized. Second,we propose a pivot selection which allows using data sub-graphs as pivots. Third, we propose an authenticated MCScomputation to reduce the computation at clients. Ourexperiments show that GMTree is efficient and the VOs aresmall. In future work, we will investigate supergraph simi-larity search.

    APPENDIX A

    PROOFS

    In this appendix, we present all the proofs skipped in themain body of this paper.

    A.1 Proof of Lemma 4.1

    Lemma 4.1. Let q, G1 and G2 denote the query graph and twodata graphs, respectively. Given a similarity threshold t, if1jmcsG1;G2jjG1j dq;G1 > t, then dsq;G2 > t.

    Fig. 16. Performance results of optimizations.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1851

  • Proof. We conduct the proof through a detailed caseanalysis.

    Case A: jqj jG1j. It follows that

    1 jmcsG1; G2jjqj 1jmcsG1; G2j

    jG1j : (6)

    Case A.1. jG2j jqj.In this case, G2 has at least one subgraph of size jqj.

    Let S fS1; S2; . . . ; Sng denote all the subgraphs of G2of size jqj. There must exist at least one Sm in S, suchthat (i) jmcsq;G2jjmcsq; Smj, and (ii) jmcsq; Smj jmcsq; Sij for Si 2 S.

    By (i), dsq;G2 dq; Sm and by (ii), dq; Si dq; Sm for Si 2 S. We then have the following necessaryand sufficient condition

    dsq;G2 > t, dq; Si > t; for all Si 2 S: (7)Now our task is to prove dq; Si > t for all Si 2 S.Since Si is a subgraph of G2, jmcsG1; G2j

    jmcsG1; Sij for all Si 2 S. Hence, for 8Si 2 S,

    1 jmcsG1; SijjSij 1jmcsG1; G2j

    jSij : (8)

    Because of jqj jSij, (8) becomes

    1 jmcsG1; SijjSij 1jmcsG1; G2j

    jqj : (9)

    Combine (6) and (9) together, we have

    1 jmcsG1; SijjSij 1jmcsG1; G2j

    jG1j : (10)

    Since jG1j jSij, dG1; Si1 jmcsG1;SijjSij , (10) becomes

    dG1; Si > 1 jmcsG1; G2jjG1j :

    Since 1 jmcsG1;G2jjG1j dq;G1 > t is given in Lemma4.1, dG1; Sidq;G1 > t. Finally, considering dq; Si dG1; Si dq;G1 by the triangle inequality, we obtaindq; Si > t, for all Si 2 S.

    Case A.2. jG2j < jqj.In this case, we have dsq;G2 dq;G2, as

    maxjqj; jG2j jqj. Our task is then to prove dq;G2 > t.Case A.2.1. jG1j jG2j < jqj.Since dG1; G2 1 jmcsG1;G2jjG2j > 1

    jmcsG1;G2jjG1j , then

    dG1; G2 dq;G1 > t by the condition given inLemma 4.1. By the triangle inequality, dq;G2 > t.

    Case A.2.2. jG2j < jG1j < jqj.In this case, dG1; G2 1 mcsG1;G2jG1j . Hence, dG1;

    G2 dq;G1 > t by the condition given in Lemma 4.1.By the triangle inequality, dq;G2 > t.

    Case B. jqj < jG1j.Case B.1. jG2j jqj.Let S fS1; S2; . . . ; Sng denote the set of subgraphs of

    G2 of the size q. The necessary and sufficient condition(7) still applies. Our task is to prove dq; Si > t for allSi 2 S.

    Since Si is a subgraph of G2, jmcsG1; G2j jmcsG1; Sij. Then, for all Si 2 S,

    1 jmcsG1; G2jjG1j 1jmcsG1; Sij

    jG1j dG1; Si:

    Since 1 jmcsG1;G2jjG1j dq;G1 > t is given inLemma 4.1, dG1; Si dq;G1 > t for all Si 2 S. Thus,by the triangle inequality, dq; Si > t for all Si 2 S.

    Case B.2. jG2j < jqj.In this case, dsq;G2 dq;G2. Since jG1j > jG2j,

    dG1; G2 1 jmcsG1;G2jjG1j . Thus, dG1; G2 dq;G1 > tby the condition given in Lemma 4.1. By the triangle

    inequality, dq;G2 > t. tu

    A.2 Proof of Theorem 4.2

    Theorem 4.2. 1 Given a query q, an augmented graph G1 and agraph G2, if dG1; G2 dq;G1 > t, then dsq;G2 > t.

    Proof. We recall the definition of augmented graph Defini-tion 4.1 as follows.

    An augmented graph G1 with respect to G2 is defined asfollows:

    G1 G1 if jG1j jG2j; Otherwise, G1 G1 [A, where A is an aug-

    mented subgraph having nodes and edges withlabels never occurred in G2 and possible queries,until jG1j jG2j.

    We conduct the proof by applying Lemma 4.1 andDefinition 3.1 as follows.

    dG1; G2 dq;G1 > t

    ) 1 jmcsG1; G2j

    maxfjG1j; jG2jg dq;G1 > t

    by Definition 3:1

    ) 1 jmcsG1; G2j

    jG1j dq;G1 > t

    as jG1j jG2j) dsq;G2 > t

    by Lemma 4:1: tu

    A.3 Calculation ofHb in Section 6.4

    We first present the formula for the size of Hb and the num-ber of candidate graphs of GMTree. Then, we plug in somenumbers to construct an example presented in Section 6.4.

    Proposition A.3. Given a GMTree T with a fanout c indexing agraph database D, if each node in T has a fraction k 2 0; 1 ofsub-GMTrees visited, the number of candidate graphs is

    jDj klogcjDj1

    and the number of digests needed inHb is

    1 kc 1 kclogcjDj11 kc :

    Proof. Since GMTree T is a complete tree, T has jDj=c leavesand T s height h logcjDj.Regarding the number of candidate graphs: Since eachnode of T has a fraction k of subtrees visited, the total

    1852 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • number of leaf nodes visited is kch1. Since each leafnode contains c data graphs, the number of candidategraphs is ch kh1 jDj klogcjDj1.Regarding the number of digests in Hb: Since each nodeof T has a fraction k of subtrees visited, each node has1 kc subtrees pruned. Therefore, T s ith level has1 kc kci1 subtrees pruned for 1 i h 1. Sincewe need a digests for each pruned subtree inHb, the total

    number of digests in Hb is Ph1i1 1 kc kci1 1kc1kclogc jDj1

    1kc : tuNow we are ready to plug in some numbers to construct

    the example presented in Section 6.4. If k 3=4; c 8;jDj 10;000, the number of candidate graphs is about 3,700and GMTree needs 186 digests inHb to authenticate the pres-ence of all the candidate graphs. In comparison, if the candi-date graphs and non-candidate graphs are storedalternately, each candidate graph needs a digest and hence3,700 digests are needed in the worst case. Therefore,GMTree just needs 5 percent digests of the worst case.

    A.4 Proof of Cost Model of VO Size in Section 6.4Let ssig, srev, srad, sptr, sh and sG denote the sizes of a signa-ture, a relevant content of a node, a radius value, a pointer,a digest and a data graph, respectively. Let Vvisited denotethe set of visited nodes of GMTree. Please refer to Table 1 forthe frequently used notations.

    Proposition A.4. The overall VO size of a query q; t on aGMTree T of a fanout c is:

    jVOj X

    v is visitedv is internal

    srevc 1srad

    Xv is visited

    sptr sptr==Hvfor internal

    2shX

    v is visitedv is internal

    c jVvisitedj 10@

    1A ==Hb

    X

    v is visitedv is leaf

    c 1sh c sptr ==Hv for leaf

    Xv is visitedv is leaf

    csh sG si ==M and N

    ssig ==sroot:

    Proof. Procedure auth similarity performs a traversal on Tand constructs VO simultaneously by Definition 6.2. Dur-ing traversal, when auth similarity visits a node v (e.g.,grey circles and rectangles in Fig. 17), auth similarityaddsseveral terms toverifyv inVO. Specifically,

    i) if v is an internal node (e.g., the grey circles inFig. 17),

    - auth similarity adds srev bytes in Hv for the

    relevant content of v (For simplicity of analy-sis, we do not consider the hints that optimizeMCS verification, which affects the analysis ina non-trivial manner);

    - Suppose tv subtrees of v are not pruned inLines 11-13 of Fig. 7 (e.g., the grey children ofroot in Fig. 17). auth similarity adds tv sptrbytes for the subtree not pruned toHv;

    - auth similarity adds c 1 srad bytes forthe radii toHv. (Since the first radius is alwayszero, we do not add it to VO for saving);

    - auth similarity also adds 2c tv shbytes for the pruned subtrees (e.g., the whitechildren of root in Fig. 17), one sh for thedigest of each pruned subtrees and one sh forthe digests of each radius toHb.

    ii) Otherwise, v is a leaf (e.g., the grey rectangles inFig. 17),

    - auth similarity adds 1 c sh c sptrbytes to Hv, including hv, c pointers and cdigest of pointers;

    - auth similarity also adds c sG c shc si bytes toM and N for the answers, non-answers and their digests. (Note that the MCSmappingsm inM is not included for analysissimplicity asms size is at most sG bytes.)

    At the end of the traversal, ssig bytes for sroot are addedto VO.

    In all, the size of VO is

    jVOj X

    v is visitedv is internal

    srev tvsptr c 1srad

    ==Hv for internal

    X

    v is visitedv is internal

    2c tvsh

    ==Hb

    X

    v is visitedv is leaf

    c 1sh csptr ==Hvfor leaf

    Xv is visitedv is leaf

    csG csh csi

    ==M and N

    ssig ==sroot:(11)

    TABLE 1Table of Frequently Used Notations

    Fig. 17. Illustration for VO size analysis.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1853

  • Since tv is the number of visited children of v, hencePv is visitedv is internal

    tv counts the number of nodes visited in tra-

    versal excluding the root. That isXv is visitedv is internal

    tv jVvisitedj 1:

    Therefore, Formula (11) becomes

    jVOj X

    v is visitedv is internal

    srev c 1srad

    Xv is visited

    sptr sptr

    ==Hv for internal

    2shX

    v is visitedv is internal

    c jVvisitedj 10@

    1A ==Hb

    X

    v is visitedv is leaf

    c 1sh c sptr ==Hv for leaf

    Xv is visitedv is leaf

    csh sG si ==M and N

    ssig ==sroot:tu

    A.5 Proof of Formula (4) in Section 7.2

    Proposition A.5. Suppose a^, a and b are the sample minimum,population minimum and population maximum, respectively.Suppose further the graph distance between any two graphs isuniformly distributed in a; b. Let s denote the sample size and bound the error of a^ from a. Then, for any 0 < < b a,we have

    Pra^ a 1 1 s.Proof. Since the graph distance d between any two graphs is

    uniformly distributed in a; b, let pd denote the proba-bility density function of d, we have

    pd 1

    b a if x 2 a; b0 otherwise.

    (

    Because we randomly choose s samples, the sampleminimum a^ is also a random variable. Let Pra^ vdenote the probability of v is the sample minimum. Theprobability of a^ v equals to

    Pra^ v s 1 Prd < vs1 pv; (12)where 1 Prd < vs1 means s 1 samples are largerthan v and the coefficient s means v can occur at anysample.

    Since a^ is the sample minimum, a a^ b and randomvariable a^ a measures the error of a^ from a. Since0 a b 1, the error a^ a b a 1.

    Given 2 0; b a as a bound of error, the probabilityof a^ a is bounded by equals to

    Pra^ < a Z a0

    Pra^ z dz

    Z aa

    Pra^ z dz:(13)

    Integrating Equation (12), Equation (13) becomesZ aa

    s1 Prd < zs1pz dzZ aa

    s 1 z ab a

    h is1 1b a dz

    Z aa

    s1

    b a s

    b zs1 dz

    1b a

    sZ aa

    sb zs1 dz

    1b a

    sZ aa

    sb zs1db z

    1b a

    sb zsa

    a

    1b a

    sb a s b as 1 1

    b a s

    because 0 < b a < 1; > 0> 1 1 s: tu

    A.6 Proof of Soundness and Completeness ofProcedure auth

    Theorem A.6. Procedure auth is sound and complete.

    Proof (Proof of Soundness). Assume that a graph in thequery result in VO is modified or bogus. Since the hashfunction is assumed to be one-way, the correct hash ofthe leaf node (Lines 06,15 of auth aux) cannot be synthe-sized. Hence, hroot will not agree with the signature srootprovided by the DO, and the malicious modification isdetected by the client (Line 02 of auth).

    Given the graphs returned by the SP are not modified,Lines 09-14 of auth aux guarantee the soundness of theanswers.

    (Proof of Completeness). LetG is an answer of Q but notincluded M. Suppose vG is the leaf node containing G.Either the content of vG or the hash containing vG isstored in VO. In the former case, G will be extractedfrom N in Line 05 of auth aux and detected in Lines 12-14 of auth aux. In the latter case, since vG contains G, vGmust overlap with the query q. By applying the similarargument, one of the ancestors of vG or vG itself overlapswith q, but it is wrongly included in the VO.Hb by theSP. Then, there are only two possible cases. First, thenumber of pruned subtrees (i.e., tv) determined by thepruning condition in Line 22 of auth aux will not matchthe number of digests in Hb (Lines 25-27 of auth aux).Second, the radius list has been forged, which makesLine 22 of auth aux passes. In either case, the digest can-not be synthesized. The missing result will be detectedby Line 02 of auth. tu

    A.7 Proof of Correctness of auth-VC-mcs

    Theorem A.7. auth-VC-mcs is correct.

    Proof. Foremost, we assume the correctness of VC-mcs [20].From the procedure of the clients authentication, we

    1854 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • note that the client reads the following from VO: thegraph p, the maximal independent sets misp of p toconstruct the bigraphs, the maximum matchings JM sand minimum vertex covers VCM s of the bigraphsand the signatures. Other structures used in the authenti-cation process are constructed by the client himself. Sincethe signatures cannot be tampered with, we can provethe correctness of auth-VC-mcs by proving the correctnessof p,misp and JM and VCM , respectively.

    Case A. If p is tampered with. The synthesized digestof the root of GMTree will not agree with the signaturesroot in VO as detected in Procedure auth in Section 6.

    Case B. If the misp of p is tampered with. The clientcan easily detect it by using misps signature recordedin VO and DOs public key.

    Case C. If the maximum matching JM or the mini-mum vertex cover of a bigraph Bp;q;M is tampered with.The client can easily detect it by checking whetherjJM j jVCM j since we have a theorem in graph theoriesthat the size of the maximum matching of a bigraph Bequals to the size of the minimum vertex cover of B.

    We have proved the correctness of q, misq and JMsand VCMs in VO, and hence the correctness ofauth-VC-mcs is proved. tu

    Proof of NP-hardness of the Pivot Selection Problem

    Theorem A.8.The Pivot Selection Problem is NP-hard.

    Proof. Let us recall the definition of the pivot selection prob-lem as follows.

    Given a set of graphs D fG1;G2; . . . ;Gmg and the setof subgraphs S fS j S G;G 2 Dg, select one graph pas pivot from D [ S to minimize the cost function

    X maxi1...m

    fdp; Gig;

    where p is the augmented pivot constructed from p asdiscussed in Section 4.

    We first prove that the simplest case of the pivot selec-tion problem (i.e., when m 2) is NP-hard. Then, thegeneral pivot selection problem is readily NP-hard.

    When m 2, D fG1; G2g and S fS jS G1 _ S G2g. We have (i) the graph p mcsG1; G2 isan optimum pivot and (ii) all other optimum pivots aresupergraphs of mcsG1; G2. We prove (i) and (ii) sepa-rately as follows:

    (i) p mcsG1; G2 is an optimum pivotWe prove it by contradiction. Suppose there is another

    graph p0 2 D [ S which is better than p, i.e.,

    max 1 jmcsp;G1jfjpj; jG1jg ; 1jmcsp;G2j

    maxfjpj; jG2jg

    > max 1 jmcsp0; G1j

    fjp0j; jG1jg ; 1jmcsp0; G2j

    maxfjp0j; jG2jg

    : (14)

    Since jpj jp0j maxfjG1j; jG2jg by Definition 4.1,(14) becomes

    maxf1 jmcsp;G1j; 1 jmcsp;G2jg> maxf1 jmcsp0; G1j; 1 jmcsp0; G2jg

    , minfjmcsp;G1j; jmcsp;G2jg< minfjmcsp0; G1j; jmcsp0; G2jg: (15)

    Since p mcsG1; G2, (15) becomes

    jmcsG1; G2j < minfjmcsp0; G1j; jmcsp0; G2jg: (16)

    However, (16) is impossible. The reason is that p0 canonly be G1, G2 or their subgraphs, which are analyzedone-by-one as follows.

    Case A. If p0 G1, minfjmcsp0; G1j; jmcsp0; G2jg jmcsG1; G2j

    Case B. If p0 G2, similar to Case A.Case C. If p0 is a subgraph of G1, jmcsp0; G1j

    jmcsp0; G2j and jmcsp0; G2j jmcsG1; G2j Hence,jmcsG1; G2j minfjmcsp0; G1j; jmcsp0; G2jg.

    Case D. If p0 is a subgraph of G2, similar to Case C.We have found a contradiction and hence

    p mcsG1; G2 is an optimum pivot.(ii) all other optimum pivots are supergraphs of

    mcsG1; G2:We also prove it by contradiction. Suppose there is an

    optimum pivot p00 2 D [ S that does not containp mcsG1; G2. Then, we have the following:

    jmcsG1; G2j > minfjmcsp00; G1j; jmcsp00; G2jg, maxf1 jmcsp;G1j; 1 jmcsp;G2jg< maxf1 jmcsp00; G1j; 1 jmcsp00; G2jg

    by Definition 4:1 jpj jp00j maxfjG1j; jG2jg

    , max 1 jmcsp;G1jmaxfjpj; jG1jg ; 1

    jmcsp;G2jmaxfjpj; jG2jg

    < max 1 jmcsp00; G1j

    maxfjp00j; jG1jg ; 1jmcsp00; G2jmaxf00j; jG2jg

    by Definition 3:1 and mcsp;Gi mcsp; Gi

    , maxfdp; G1; dp; G2g< maxfdp00; G1; dp00; G2g:

    This means p00 is not an optimum pivot as it is worsethan p. It is a contradiction and all optimum pivots con-tainmcsG1; G2.

    Therefore, finding the optimum pivot for m 2 isequivalent to compute the MCS between G1 and G2. Sincethe MCS computation is NP-hard, the pivot selection prob-lem when m 2 is NP-hard. Consequently, the generalpivot selection problem is NP-hard. tu

    A.9 Time Complexity Analysis of GMTree construction

    Proposition A.9. The time complexity of GMTree constructionwith the sampling-base pivot selection is

    OlogcjDj 1 s jGj dist jDj;

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1855

  • where dist O3jG0 j=3jG0j2:5jG0j 1jVCGj is the time com-plexity of VC-mcs in Section 7.1.1.

    Proof. Consider the ith level of GMTree, for any internalnode v, we need to perform the pivot selection. We needOjGj time to sample a pivot and Odist jDvj time tocompute the distance between the pivot with all datagraphs covered by v, denoted as Dv. We need s samplesand hence Os jGj distDv time for node v.

    Since the internal nodes of the ith level of GMTreetogether covers all data graphs in D, i.e., Pv jDvj jDj,all internal nodes of the ith level need Os jGjdistPv jDj Os jGj dist jDj;

    Since the height of GMTree is OlogcjDj, the total timecomplexity is OlogcjDj 1 s jGj dist jDj asthe leaf nodes do not need pivot selection. tu

    APPENDIX B

    SUBGRAPH SIMILARITY SEARCH ON GMTree

    In this appendix, we present the details of the algorithm ofsubgraph similarity search on GMTree.

    In a nutshell, given a query Q q; t, where q is thequery graph and t is its radius, the search algorithm is incor-porating the pruning of Theorem 4.2 into a depth-first tra-versal on GMTree.

    Procedure similarity presents the details of searchalgorithm as shown in Fig. 18. The inputs are a database D,the GMTree index T and query Q q; t. The output is thequery result RS fG j dsq;G t; G 2 Dg. Proceduresimilarity traverses the GMTree index and filters out sub-spaces that certainly do not contain answers (Lines 08-11 byusing the pruning condition of Theorem 4.2). The subtreesindexing the remaining subspaces may contain answersand they are traversed (Line 10). When a leaf node isreached, the graphs that cannot been filtered by Lemma 4.1are candidates (Line 04) and checked individually whetherthey are in fact answers (Line 05).

    APPENDIX C

    QUERY WORKLOAD IN EXPERIMENTS

    Regarding AIDS, our query workload was modified from thebenchmark queries Q4, Q8 and Q12 used in [1]. Our

    modifications were two. (1) Qm, m 4; 8; 12 means that thequery graph has m vertices (versus m edges in [1]), sinceour subgraph distance is defined based on the number ofgraph vertices. (2) Qm is a mixture of queries of diversedensities. To obtain a query of a certain density, we ran-domly selected a query in Qm and added or removed verti-ces randomly until it satisfied the requirement. Each Qmcontains 1,000 query graphs. Regarding SYN, we used themethod of [50] to obtain a set of queries. For example, forQ12, we first randomly selected 1,000 graphs whose sizeswere larger than 12 from SYN. Then, for each graph, we ran-domly removed vertices from it until its size was 12 andapplied the same modifications as AIDS. Following [2], thepossible numbers of missing vertices s of q are 1, 2 and 3.

    APPENDIX D

    ADDITIONAL EXPERIMENTS

    In this appendix, we first present the experimental resultson the PUBCHEM data set. The results on PUBCHEM are similarto those on AIDS and SYN. We then present the performancesof GMTree using data graphs as pivots on AIDS. We finallypresent the experimental results on s on AIDS and SYN.

    D.1 Experiments on PUBCHEM

    Experimental settings. The experiments on PUBCHEM were con-ducted with the same settings of the experiments on AIDSand SYN in Section 8.PUBCHEM data set. We followed the method of [50] to sample10,000 graphs as our test data set PUBCHEM from the Pub-Chem chemical structure database [10].Query workload. The same method to obtain the queries onSYN was applied on PUBCHEM to obtain 1,000 queries. The pos-sible numbers of missing vertices s of q was also set to 1, 2and 3.

    Similar to the experiments on AIDS and SYN, we presentthe results of our experiments on PUBCHEM in terms of thecomparison with the baseline and Grafil, the experimentson query size and the experiments on s.

    D.1.1 Comparison with the Baseline and Grafil

    Comparison of VO size. Figs. 19a and 19b present the sizes ofthe VOindex and the VOcand, respectively. From Fig. 19a, wenote that the VOindex of our GMTree is only 30 percent ofGrafil and smaller than 50 percent of the baseline. FromFig. 19b, we note that GMTree produces slightly more VOcandthan Grafil. However, GMTree produces smaller number oflarge candidate graphs than Grafil, which would affect theauthentication efficiency significantly. Fig. 19b also presentsthat our GMTree produces smaller VOcand than the baseline.

    Comparison of time. Figs. 19c and 19d present the SPsquery time and the clients authentication time,

    Fig. 18. Procedure similarity:

    TABLE 2Characteristics of Data Sets

    1856 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • respectively. From Fig. 19c, we note that the GMTrees queryat the SP is 3 times slower on average than Grafil. How-ever, the clients authentication time is usually the bottle-neck and the GMTrees authentication is 6 times faster (onaverage) than Grafil as shown in Fig. 19d. From Figs. 19cand Figs. 19d, we also note that the SPs query time and theclients authentication time of the GMTree are 30 and 60 per-cent smaller than those of the baseline, respectively.

    D.1.2 Experiments on Query Size

    Query time. Fig. 20a presents the SPs query time versusthe query size. From Fig. 20a, we note that the SPs querytime increases significantly with the growth of the querysize. It is consistent to the complexity of computing MCS.From Fig. 20a, we also note that the query time reducesas the fanout increases. It is because that the larger thefanout, the less the distance computation involved inquery processing.

    Authentication time. Fig. 20b presents the clients authenti-cation time. Fig. 20b presents that the clients authenticationtime is much smaller than the query time at the SP. In par-ticular, the authentication of Q12 at fanout 16 is three timesfaster than query. From Fig. 20b, we also note that the cli-ents authentication time increases as the query sizeincreases. It is because that the MCS computation in the cli-ents authentication for larger queries takes longer. Sincethe number of distance computation reduces with thegrowth of the fanout (as discussed in the query time above),the clients authentication time reduces with the growth ofthe fanout as shown in Fig. 20b.

    Overall VO size. Fig. 20c presents the overall VO size.We observe that the overall VO size reduces with thegrowth of the query size. It is because that larger queriesproduced fewer candidate graphs, which dominate theVO. We also note that the VO size increases with thegrowth of fanout. It is consistent with the observations ofMRTree and MBTree.

    D.1.3 Experiments on s

    Query time. Fig. 20d presents the SPs query time. FromFig. 20d, we note that the SPs query time increases as sincreased. It is because that the larger query range, the more

    data graphs needed to be tested. We also note that the querytime reduces as the fanout increases. It is because that thelager fanout, the fewer pivots needed to be accessed inGMTree traversal.

    Authentication time. Fig. 20e presents the clients authenti-cation time. From Fig. 20e, we observe that the clientsauthentication time increases with the growth of s. It isbecause that the larger query range, the more candidategraphs needed to be authenticated. We also note that the cli-ents authentication time reduces as s increases. It isbecause that the number of pivots involved in the clientsauthentication reduces with the growth of fanout.

    VO index size. Fig. 20f presents the VOindex size of GMTree.Fig. 20f shows that the VOindex increases with the growth ofs. It is because that the number of accessed internal nodesincreases as s increases. Fig. 20f also presents that theVOindex size increases as the fanout increases. It is becausethat the number of pruned subtrees increases as the fanoutincreases. However, the VOindex is well controlled within30K bytes.

    D.2 Additional Experiments on Using Data Graph asPivots on AIDS

    This experiment used the same settings on AIDS as presentedin Section 8. GMTree by default allows to use subgraphs aspivots as detailed in Section 7.2. In this experiment, we con-struct a specific GMTree that uses the data graph as pivots.The fanout of the GMTree is 16. Figs. 21a and 21b present thecomparison of the query time and the total VO size, respec-tively. From Fig. 21a, we observe that the query time of ourmethod is smaller than 20 percent of using data graph aspivots. Moreover, the VO sizes of our method are 52, 56 and72 percent of using data graph as pivots at s 1, 2 and 3,respectively, as shown in Fig. 21b. This experiment verifiesthe effectiveness of using subgraphs as pivots.

    Fig. 19. Comparison of GMTree with Grafil and the baseline method interms of VO size and time on PUBCHEM.

    Fig. 20. Experiments on query size and s on PUBCHEM.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1857

  • D.3 Additional Experiments on s on AIDS and SYN

    This experiment used the same settings as presented in Sec-tion 8. Similar to the experiments in Section 8, this experi-ment presents the performances of GMTree in terms of thequery time, the authentication time and the VO size, respec-tively. We use AIDS and SYN and we focus on Q8 in thisexperiment, since other two query sets exhibit similar per-formance characteristics.

    Query time. Figs. 22a and 22b present the query time onAIDS and SYN, respectively. From Figs. 22a and 22b, weobserve that the query time increases as s increases. It isbecause that the larger s, the more candidate graphs pro-duced. We note from Fig. 22a that the query time of s 3 isslightly smaller than that of s 2. It is because that (i) fors 2 and 3, AIDSs answer graphs account for most of candi-date graphs, i.e., 74 and 69 percent for s 2 and 3 at the fan-out 16, respectively. (ii) Answer graphs of s 2 are allanswers of s 3. (iii) If a graph is an answer of s 2 or 3,our Enum method can detect it for s 3 with a lower timecost than s 2, because that a smaller number of subgraphsneed to be determined. Figs. 22a and 22b also present thatthe query time reduces as the fanout increases. It is becausethat the larger fanout, the fewer pivots accessed in traversal.SYNs time reduction is slight as shown in Fig. 22b, becausethat graphs in SYN are dense and their MCS computationdominates the query time.

    Authentication time. Figs. 22c and 22d present the authen-tication time on AIDS and SYN, respectively. From Figs. 22cand 22d, we observe that the authentication time increaseswith the growth of s. It is because that the larger s, themore candidate graphs needed to be verified. Because of thelarge percentage of answer graphs in the candidate set (seediscussion in query time above), s 3 of AIDS takes shorterauthentication time than s 2 as shown in Fig. 22c.Figs. 22c and 22d also show that the authentication timereduces with the growth of fanout. It is because that thenumber of pivots involved in authentication reduces as thefanout increases.

    VOindex size. Figs. 22e and 22f present the size of VOindexon AIDS and SYN, respectively. Recall that the VOindex mainlycontains two parts: (i) the relevant content, MCS and hints ofaccessed internal nodes and (ii) the digests of pruned sub-trees. Figs. 22e and 22f present that the VOindex increaseswith the growth of s. It is because that the number ofaccessed internal nodes increases as s increases. Figs. 22eand 22f also present that the VOindex increases as the fanoutincreases. It is because that the number of pruned subtreesincreases as the fanout increases. However, the VOindex iswell controlled within 30K bytes. Fig. 22f presents that theVOindex reduces slightly from s 2 to 3 on SYN. It is because

    that the number of pruned subtrees reduces more signifi-cantly than the increase of accessed node number. In partic-ular, the number of pruned subtrees reduces 16 percentfrom s 2 to 3, but the number of accessed node increasesonly 1 percent at fanout 16.

    ACKNOWLEDGMENTS

    This work was partly supported by General Research Fundof Hong Kong (GRF210510) and Natural Science Founda-tion of China (71171122). The work of Yun Peng wasfinished while he was pursuing his Ph.D study in theDepartment of Computer Science at Hong Kong BaptistUniversity.

    REFERENCES[1] X. Yan, P. S. Yu, and J. Han, Substructure similarity search in

    graph databases, in Proc. ACM SIGMOD Int. Conf. Manage. Data,2005, pp. 766777.

    [2] H. Shang, X. Lin, Y. Zhang, J. X. Yu, and W. Wang, Connectedsubstructure similarity search, in Proc. ACM SIGMOD Int. Conf.Manage. Data, 2010, pp. 903914.

    [3] Y. Zhu, L. Qin, J. X. Yu, and H. Cheng, Finding top-k similargraphs in graph databases, in Proc. 15th Int. Conf. Extending Data-base Technol., 2012, pp. 456467.

    [4] Y. Yuan, G. Wang, L. Chen, and H. Wang, Efficient subgraphsimilarity search on large probabilistic graph databases, in Proc.VLDB Endow., vol. 5, no. 9, pp. 800811, 2012.

    [5] D. W. Williams, J. Huan, andW. Wang, Graph database indexingusing structured graph decomposition, in Proc. IEEE 23rd Int.Conf. Data Eng., 2007, pp. 976985.

    [6] H. He and A. K. Singh, Closure-tree: An index structure forgraph queries, in Proc. 22nd Int. Conf. Data Eng., 2006, pp. 3838.

    [7] S. Ranu and A. K. Singh, Indexing and mining topological pat-terns for drug discovery, in Proc. 15th Int. Conf. Extending Data-base Technol., 2012, pp. 562565.

    [8] T. Yuanyuan, R. C. McEachin, S. Carlos, D. J. States, and J. M.Patel, Saga: A subgraph matching tool for biological graphs,Bioinformatics, vol. 23, no. 2, pp. 232239, 2007.

    Fig. 21. Performance of using data graph as pivots on AIDS.

    Fig. 22. Effects of s on AIDS and SYN.

    1858 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

  • [9] M. Mongiovi, R. D. Natale, R. Giugno, A. Pulvirenti, A. Ferro, andR. Sharan, Sigma: a set-cover-based inexact graph matching algo-rithm., J. Bioinformat. Computat. Biol., pp. 199218, 2010.

    [10] NCBI. (2012). PubChem, [Online]. Available: http://pubchem.ncbi.nlm.nih.gov

    [11] B. A. Grning, C. Senger, A. Erxleben, S. Flemming, and S. Gnther,Compounds in literature (CIL): Screening for compounds andrelatives in pubmed, Bioinformatics, vol. 27, pp. 13411342, 2011.

    [12] O2I. (2013) [Online]. Available: http://www.outsource2india.com[13] I. Outsourcing. (2013) [Online]. Available: http://www.informa-

    ticsoutsourcing.com[14] Silico. (2013) [Online]. Available: http://wbbiotech.nic.in/wbbio-

    tech/writereaddata/silicogene.htm[15] Dataoutsourcingindia. (2013) [Online]. Available: http://www.

    dataoutsourcingindia.com[16] H. Pang and K. L. Tan, Query Answer Authentication. San Rafael,

    CA, USA: Morgan & Claypool Publishers, 2012.[17] P. Ciaccia, M. Patella, and P. Zezula, M-tree: An efficient access

    method for similarity search in metric spaces, in Proc. 23rd Int.Conf. Very Large Data Bases, 1997, pp. 426435.

    [18] J. K. Uhlmann, Satisfying general proximity/similarity querieswith metric trees, Inform. Process. Lett., vol. 40, no. 4, pp. 175179,1991.

    [19] P. N. Yianilos, Data structures and algorithms for nearest neigh-bor search in general metric spaces, in Proc. 4th Annu. ACM-SIAM Symp. Discrete Algorithms, 1993, pp. 311321.

    [20] F. N. Abu-Khzam, N. F. Samatova, M. A. Rizk, and M. A. Lang-ston, The maximum common subgraph problem: Faster solu-tions via vertex cover, in Proc. IEEE/ACS Int. Conf. Comput. Syst.Appl., 2007, pp. 367373.

    [21] H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan, Verifying com-pleteness of relational query results in data publishing, in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2005, pp. 407418.

    [22] Y. Yang, S. Papadopoulos, D. Papadias, and G. Kollios, Spatialoutsourcing for location-based services, in Proc. IEEE 24th Int.Conf. Data Eng., 2008, pp. 10821091.

    [23] W. Cheng and K.-L. Tan, Query assurance verification for out-sourced multi-dimensional databases, J. Comput. Secur., vol. 17,no. 1, pp. 101126, 2009.

    [24] Y. Yang, D. Papadias, S. Papadopoulos, and P. Kalnis,Authenticated join processing in outsourced databases, in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2009, pp. 518.

    [25] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, Dynamicauthenticated index structures for outsourced databases, in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 121132.

    [26] H. Pang and K. Mouratidis, Authenticating the query results oftext search engines, Proc. VLDB Endow., vol. 1, no. 1, pp. 126137,2008.

    [27] M. L. Yiu, Y. Lin, and K. Mouratidis, Efficient verification ofshortest path search via authenticated hints, in Proc. IEEE 26thInt. Conf. Data Eng., 2010, pp. 237248.

    [28] M. Goodrich, R. Tamassia, N. Triandopoulos, and R. Cohen,Authenticated data structures for graph and geometricsearching, in Proc. RSA Conf. Cryptographers Track, 2003,vol. 2612, pp. 295313.

    [29] A. Kundu and E. Bertino, How to authenticate graphs withoutleaking, in Proc. 13th Int. Conf. Extending Database Technol., 2010,pp. 609620.

    [30] A. Kundu and E. Bertino, Structural signatures for tree datastructures, in Proc. VLDB Endow., vol. 1, no. 1, pp. 138150, 2008.

    [31] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. G.Stubblebine, A general model for authenticated data structures,Algorithmica, vol. 39, no. 1, pp. 2141, Jan. 2004.

    [32] H. Jiang, H. Wang, P. Yu, and S. Zhou, GString: A novelapproach for efficient search in graph databases, in Proc. IEEE23rd Int. Conf. Data Eng., 2007, pp. 566575.

    [33] M. Mann, F. Nahar, H. Ekker, R. Backofen, P. Stadler, and C.Flamm, Atom mapping with constraint programming, in Proc.19th Int. Conf. Principles Practice Constraint Programm., 2013,vol. 8124, pp. 805822.

    [34] Y. Tian and J. M. Patel, Tale: A tool for approximate largegraph matching, in Proc. IEEE 24th Int. Conf. Data Eng., 2008,pp. 963972.

    [35] A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao,Neighborhood based fast graph search in large networks, inProc. ACM SIGMOD Int. Conf. Manage. Data, 2011, pp. 901912.

    [36] X. Yan, P. S. Yu, and J. Han, Graph indexing: A frequent struc-ture-based approach, in Proc. ACM SIGMOD Int. Conf. Manage.Data, 2004, pp. 335346.

    [37] H.-C. Ehrlich and M. Rarey, Maximum common subgraph iso-morphism algorithms and their applications in molecular science:A review,WIREs Comput. Mol. Sci., vol. 1, no. 1, pp. 6879, 2011.

    [38] P. Vismara and B. Valery, Finding maximum common connectedsubgraphs using clique detection or constraint satisfaction algo-rithms, in Proc. 2nd Int. Conf. Modelling, Comput. OptimizationInform. Syst. Manage. Sci., 2008, vol. 14, pp. 358368.

    [39] J. W. Raymond and P. Willett, Maximum common subgraph iso-morphism algorithms for the matching of chemical structures,J. Comput.-Aided Mol. Des., vol. 16, pp. 521533, 2002.

    [40] J. Lauri, Subgraphs as a measure of similarity, in StructuralAnalysis of Complex Networks, New York, NY, USA: Springer, 2011pp. 319334.

    [41] H. Bunke and K. Shearer, A graph distance metric based on themaximal common subgraph, Patt. Recog. Lett., vol. 19, no. 3-4,pp. 255259, 1998.

    [42] J.T.-L. Wang, X. Wang, K.-I. Lin, D. Shasha, B. A. Shapiro, and K.Zhang, Evaluating a class of distance-mapping algorithms fordata mining and clustering, in Proc. 5th ACM SIGKDD Int. Conf.Knowl. Discovery Data Mining, 1999, pp. 307311.

    [43] R. C. Merkle, A certified digital signature, in Proc. Adv. Cryptol.,1989, pp. 218238.

    [44] G. R. Hjaltason and H. Samet, Index-driven similarity search inmetric spaces (survey article), ACM Trans. Database Syst., vol. 28,no. 4, pp. 517580, 2003.

    [45] K. Mouratidis, D. Sacharidis, and H. Pang, Partially materializeddigest scheme: An efficient verification method for outsourceddatabases, The VLDB J., vol. 18, no. 1, pp. 363381, 2009.

    [46] E. B. Krissinel and K. Henrick, Common subgraph isomorphismdetection by backtracking search, Softw. Pract. Exp., vol. 34, no. 6,pp. 591607, 2004.

    [47] SIVALab. (2013). VFLib, [Online]. Available: http://www.cs.sunysb.edu/algorith/implement/vflib/implement.shtml

    [48] RCaH. (2013). CCP4, [Online]. Available: http://www.ccp4.ac.uk/index.php

    [49] J. Cheng, Y. Ke, W. Ng, and A. Lu, Fg-index: Towards verifica-tion-free query processing on graph databases, in Proc. ACMSIGMOD Int. Conf. Manage. Data, 2007, pp. 857872.

    [50] W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu, iGraph: A frame-work for comparisons of disk-based graph indexing techniques,in Proc. VLDB Endow., vol. 3, no. 12, pp. 449459, Sep. 2010.

    Yun Peng received the BSci and MPhil degreesin computer science from Shandong Universityand the Harbin Institute of Technology (HIT) in2006 and 2008, respectively. He is workingtoward the PhD degree in theDepartment of Com-puter Science, Hong Kong Baptist University. Hisresearch interests include graph-structured data-bases. He is a member of the Database Group atHong Kong Baptist University (http://www.comp.hkbu.edu.hk/~db/).

    Zhe Fan received the BEng degree in computerscience from the South China University of Tech-nology in 2011. He is working toward the PhDdegree in the Department of Computer Science,Hong Kong Baptist University. His research inter-ests include graph-structured databases. He is amember of the Database Group at Hong KongBaptist University (http://www.comp.hkbu.edu.hk/~db/).

    Byron Choi received the bachelors of engineer-ing degree in computer engineering from theHong Kong University of Science and Technol-ogy (HKUST) in 1999 and the MSE and PhDdegrees in computer and information sciencefrom the University of Pennsylvania in 2002 and2006, respectively. He is currently an assistantprofessor in the Department of Computer Sci-ence at the Hong Kong Baptist University.

    PENG ET AL.: AUTHENTICATED SUBGRAPH SIMILARITY SEARCH IN OUTSOURCED GRAPH DATABASES 1859

  • Jianliang Xu received the BEng degree in com-puter science and engineering from Zhejiang Uni-versity, Hangzhou, China, in 1998 and the PhDdegree in computer science from the Hong KongUniversity of Science and Technology in 2002.He is currently an associate professor in theDepartment of Computer Science, Hong KongBaptist University. He held visiting positions atPennsylvania State University and FudanUniversity.

    Sourav S. Bhowmick received the PhD degreein computer engineering in 2001. He is currentlyan associate professor in the School of Com-puter Engineering, Nanyang TechnologicalUniversity. He is a visiting associate professor atthe Biological Engineering Division, Massachu-setts Institute of Technology. He held the posi-tion of Singapore-MIT Alliance fellow inComputation and Systems Biology program(05-12).

    " For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    1860 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice