[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Searching Substructures with Superimposed Distance

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Searching Substructures with Superimposed Distance

Post on 16-Apr-2017




2 download

Embed Size (px)


<ul><li><p>Searching Substructures with Superimposed Distance</p><p>Xifeng Yan Feida Zhu Jiawei Han Philip S. YuDepartment of Computer Science</p><p>University of Illinois at Urbana-Champaign</p><p>{xyan, feidazhu, hanj}@cs.uiuc.eduIBM T. J. Watson Research Center</p><p>psyu@us.ibm.com</p><p>Abstract</p><p>Efficient indexing techniques have been developed forthe exact and approximate substructure search in largescale graph databases. Unfortunately, the retrieval problemof structures with categorical or geometric distance con-straints is not solved yet. In this paper, we develop a methodcalled PIS (Partition-based Graph Index and Search) tosupport similarity search on substructures with superim-posed distance constraints. PIS selects discriminative frag-ments in a query graph and uses an index to prune thegraphs that violate the distance constraints. We identify acriterion to distinguish the selectivity of fragments in mul-tiple graphs and develop a partition method to obtain a setof highly selective fragments, which is able to improve thepruning performance. Experimental results show that PISis effective in processing real graph queries.</p><p>1 Introduction</p><p>With the increasing volume of graph databases, there isa strong need for fast graph search systems. Unfortunately,traditional indexing mechanisms can no longer addressthe challenging issues raised by complex graph databases:Given an exponential number of subgraphs in a complexstructure, we simply do not know what to index and how toindex. Interest has been growing in using unconventionalindexing techniques to tackle the search problem. Previousstudies focused on two kinds of graph search tasks: (1) theexact substructure (or full structure) search, and (2) the ap-proximate substructure (or full structure) search. The exactsubstructure search finds all of the graphs in a database that</p><p> The work was supported in part by the U.S. National Science Foun-dation NSF IIS-02-09199/IIS-03-8215, and an IBM Faculty Award. Anyopinions, findings, and conclusions or recommendations expressed in thispaper are those of the authors and do not necessarily reflect the views ofthe funding agencies.</p><p>contain the query structure, while the approximate substruc-ture search finds inexact matches in the database. Shasha etal. [12] proposed a path-based approach for the exact sub-structure search. Yan et al. [16] devised discriminative fre-quent structures and used them as indexing features. Holderet al. [7] adopted the minimum description length principlefor the approximate search. Raymond et al. [10] developeda three-tier algorithm for structure similarity search.</p><p>The two search scenarios mentioned so far are mainlyinvolved with the topological structure of graphs. However,there are other similarity search problems that are as im-portant, but which we are unable to handle yet. Let us firstcheck an example.</p><p>(a) 1H-Indene</p><p>O</p><p>O</p><p>OH</p><p>(b) Omephine</p><p>OH</p><p>O</p><p>OOH</p><p>H</p><p>(c) Digitoxigenin</p><p>Figure 1. A Chemical Database</p><p>Figure 2. A Query Graph</p><p>Example 1 Figure 1 shows a sample 2D chemical datasetconsisting of three molecules. Omephine in Figure 1(b)is an anticoagulant. Digitoxigenin in Figure 1(c) is well-known for its strong cardiotonic effect. Figure 2 showsa query graph. The three sample molecules contain thesame topological substructures as the query graph. How-ever, some of their edge labels are different from those inthe query graph. We define a mutation distance as the num-ber of times one has to relabel edges in one graph in order</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>to get another graph. According to this definition, the mu-tation distance between 1H-Indene in Figure 1(a) and thequery graph is 1: we need to mutate one edge label in 1H-Indene so that it contains exactly the query structure, withexactly the same labels. If a user wants to find graphs whosemutation distance from the query graph is less than 2, thequery system should return the first and the third graphs inFigure 1.</p><p>The example above indicates that the substructure searchwith superimposed distance constraints (SSSD) is a generalgraph search problem. We formulate the SSSD problem asfollows: Given a set of graphs D = {G1, G2, . . . Gn} anda query graph Q, find all graphs G in D such that Q is iso-morphic to a subgraph Q of G and the optimal distancebetween Q and Q is less than a threshold . We can alsorephrase the SSSD problem as a constrained graph align-ment problem: We want to find an alignment of the querygraph in target graphs such that the minimum superimposeddistance between Q and its image in the target graphs is lessthan .</p><p>One solution to this new substructure search problem isto enumerate all of the isomorphic images of Q in the tar-get graphs and check their distance. This brute-force ap-proach may not work well since it is time-consuming tocheck each graph in a large scale database. In this paper,we develop an algorithm, called PIS (Partition-based GraphIndex and Search), to tackle the SSSD problem. Our strat-egy is to first build a fragment-based index on the graphdatabase, then partition each query graph into highly selec-tive fragments, use the index to efficiently identify the setof candidate graphs, and verify each candidate to find alleligible answers. Our approach has two advantages overthe brute-force method: (1) All operations except the candi-date verification are only involved with the index structure,thus avoiding one-by-one subgraph isomorphism computa-tion for graphs in the database. The isomorphism compu-tation is performed on the candidate graph set, which is ofa significantly smaller size. (2) The candidate set itself isidentified efficiently by pruning most invalid graphs withthe help of selective fragments and a distance lower boundintroduced in this paper.</p><p>We call the index strategy of PIS fragment-based in-dex. Graphs in the database are decomposed into fragments(probably overlapping) and indexed to facilitate similaritysearch. Fragments with the same topology can be indexedusing an R-tree [4, 11] or a metric-based index [6]. We ob-served that, for many distance measures, the superimposeddistance between a query graph and a target graph is lower-bounded by the sum of distances between their correspond-ing indexed non-overlapping fragments.</p><p>This lower bound leads to efficient pruning of most in-valid graphs in the database. A query graph is partitionedinto fragments according to the index structure. Since there</p><p>are multiple ways to partition a query graph, it is importantto choose the optimal one that achieves the best pruning per-formance. We identify the criterion of an optimal partitionthat should give a set of non-overlapping fragments withthe highest selectivity. This optimization problem is, as wewill later prove, equivalent in computational complexity toa well-known NP-hard problem: maximum weighted inde-pendent set (MWIS). Although theoretical results show thatMWIS does not have any polynomial approximation solu-tion, the heuristic greedy algorithm we developed workswell for real chemical datasets. We call the overall searchstrategy partition-based search.</p><p>Our contribution in this study is an examination of anew search problem in graph databases and the proposalof a partition-based index and search algorithm. The devel-opment of our method exposes new database managementchallenges in complicated graph databases.</p><p>2 Preliminaries</p><p>Graphs with attributes are called labeled graphs. Agraph G is a subgraph of G if there exists a subgraph iso-morphism from G to G, denoted by G G. G is called asupergraph of G. The skeleton (without labels) of a graph iscalled its structure or topology. The definition of subgraphisomorphism in this paper only considers the structure of agraph.</p><p>If G is a subgraph of G and vice versa, we say G isisomorphic to G, written G = G. If G is a subgraph of G</p><p>and also has the same label information with G, we say Gis a subgraph of G with reserved label information, writtenG G.</p><p>Q</p><p>G</p><p>Q'</p><p>Figure 3. Superposition</p><p>Example 2 Figure 3 shows a superposition between thequery graph in Figure 2 (Q) and the first graph in Figure1 (G). Q is the image of Q in G. As one can see, Q Galthough Q G.</p><p>Subgraph isomorphism only gives the structural compar-ison between two graphs. The label information is also crit-ical in determining the characteristics of graphs. Thus, weneed a distance measure to differentiate labeled graphs with</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>the same structure. This kind of distance is termed super-imposed distance, a distance measure applied to two super-imposed graphs. Here we introduce two commonly usedmeasures: Mutation Distance (MD) and Linear MutationDistance (LD).</p><p>Suppose we have two isomorphic labeled graphs, G andG. We can build a superposition from G to G, which mapseach vertex of G to a unique vertex in G. The mutationdistance between G and G is defined as follows,</p><p>MD =</p><p>v=f(v)</p><p>D(l(v), l(v)) +</p><p>e=f(e)</p><p>D(l(e), l(e))</p><p>where D is a mutation score matrix, l is a label function,and f is an isomorphic function, f : V (G) V (G). Themutation score matrix includes the distance score betweena mutation from one label to another label. If the labels arenumeric, a linear distance function may be appropriate fordistance measure, e.g.,</p><p>LD =</p><p>v=f(v)</p><p>|w(v) w(v)|+</p><p>e=f(e)</p><p>|w(e) w(e)|</p><p>where w and w are the weight functions of G and G.Since multiple superpositions may exist for two isomor-</p><p>phic graphs, we usually select the best superposition thathas the smallest distance.</p><p>Definition 1 (Minimum Superimposed Distance) Giventwo graphs, Q and G, let M be the set of subgraphs in Gthat are isomorphic to Q, M = {Q|Q G Q = Q}.The minimum superimposed distance between Q and G isthe minimum distance between Q and Q in M ,</p><p>d(Q,G) = minQM</p><p>d(Q,Q), (1)</p><p>where d(Q,Q) is a distance function of two isomorphicgraphs Q and Q.</p><p>Definition 2 (Substructure Search with SuperimposedDistance (SSSD)) Given a set of graphs D = {G1, G2,. . . Gn} and a query graph Q, SSSD is to find all Gi Dsuch that d(Q,Gi) .</p><p>A naive solution is to scan the whole database and checkwhether a target graph has a superposition with a dis-tance less than the threshold. This solution is not scal-able. A better solution, which we call topoPrune, gets rid ofgraphs that do not contain the query structure first, and thenchecks the remaining candidates to find the qualified graphs.topoPrune is more efficient than the naive approach. How-ever, it still suffers huge computational costs since it has toenumerate the superpositions of a query graph in a large setof candidate graphs. If most of the candidate graphs are notqualified, topoPrune could be very inefficient.</p><p>3 Framework of PIS</p><p>Besides structure pruning, we can also utilize the super-imposed distance constraint to prune candidates. In PIS, wepartition a query graph Q into non-overlapping fragmentsg1, g2, ..., and gn, and use them to do pruning. If a distancefunction satisfies the following inequality,</p><p>ni=1</p><p>d(gi, G) d(Q,G), (2)</p><p>we can set the lower bound of the superimposed distancebetween Q and G by the superimposed distance betweengi and G. Whenever</p><p>ni=1 d(gi, G) &gt; , we can safely</p><p>remove G from the answer set. For this kind of pruning,we only need two operations: (1) enumerate fragments inthe query graph and (2) search the index to calculate thesuperimposed distance d(gi, G). We have</p><p>d(gi, G) = mingGg=gi</p><p>d(gi, g). (3)</p><p>Therefore, if we index all of the fragments in G that have thesame topology with gi, we can calculate d(gi, G) throughthe index directly. This kind of pruning needs to check theindex only, not the original database.</p><p>In summary, we are able to use the lower bound given inEq. (2) to prune more unqualified graphs by indexing frag-ments in graph databases. This method consists of two com-ponents: fragment-based index and partition-based search.We first formalize the definition of graph partition.</p><p>Definition 3 (Graph Partition) Given a graph Q =(V,E), a partition of G is a set of subgraphs{g1, g2, . . . , gn} such that V (gi) V and V (gi)V (gj) = for any i = j.</p><p>Note that we do not require thatn</p><p>i=1 V (gi) be equal toV in the above definition. That is, the subgraphs may notfully cover the original graph. Interestingly, many distancefunctions hold the inequality in Eq. (2) for a given partition.Both distances we mentioned, mutation distance and linearmutation distance, have this inequality. We leave the proofto readers.</p><p>In Eq. (3), if a fragment g is indexed, then all of the frag-ments having the same topology as g should be indexed,since the right side of Eq. (3) has to access all of the super-positions of g in G.</p><p>Definition 4 (Structural Equivalence Class) Labeledgraphs G and G belong to the same equivalence class ifand only if G = G. The structural equivalence class of Gis written [G].</p><p>We formulate the framework of PIS (partition-basedgraph index and search) in the following three steps.</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>1. Fragment-based Index: We select a set of structuresas indexing features according to the criteria proposedin GraphGrep [12] or gIndex [16]. For each structure f(f is a bare structure without any label), we enumerateall of the fragments in the database that belong to [f ]and build an index in which a range query d(g, g) can be evaluated efficiently, where g and g are labeledgraphs and their skeleton is f .</p><p>2. Partition-based Search: For a given query graph Q,we partition it into a set of indexed non-overlappingfragments, g1, g2, . . . , gn. For each fragment gi, wefind its equivalence class in the index and submit arange query d(gi, g) to find all of the fragmentsg in the database that meet the superimposed distancethreshold. We then sum up their distance to obtainthe lower bound of d(Q,G) for each graph G in thedatabase,</p><p>ni=1</p><p>d(gi, G) =n</p><p>i=1</p><p>mingGg=gi</p><p>d(gi, g). (4)</p><p>If G does not have any subgraph g such that g = gi,we drop G from the answer set (structure violation).If the lower bound in Eq. (4) is greater than , wealso drop G from the answer set (superimposed dis-tance violation). The resulting candidate answer set,CQ, will include all of the graphs that pass the filter-ing: CQ = {G|G D </p><p>ni=1 d(gi, G) }.</p><p>3. Candidate Verification: We calculate the real super-imposed distance between Q and the candidate graphsreturned in the second step, and then remove graphsthat do not satisfy the distance threshold.</p><p>4 Fragment-based Index</p><p>In this section, we present the details of constructing afragment-based index, the first step in the framework of PIS.The index construction has two steps. In the first step, weselect structures as features. These structures do not includelabel information. In the second step, any fragment in thedatabase that has the selected structure is identified and in-dexed. That is, for each selected structure f , we enumerateall of the fragments in the graph data...</p></li></ul>


View more >