conference: multi-constrained graph pattern matching in large-scale contextual social...

Multi-Constrained Graph Pattern Matching inLarge-Scale Contextual Social Graphs

Guanfeng Liu1,2, Kai Zheng3, Yan Wang4, Mehmet A. Orgun4, An Liu1,2, Lei Zhao1,2 and Xiaofang Zhou1,3

1School of Computer Science and Technology, Soochow University, Suzhou 215006, China2Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China

3School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane 4072, Australia.4Department of Computing, Macquarie University, NSW 2109, Australia.

{gfliu, anliu, zhaol}@suda.edu.cn; {kevinz, zxf}@itee.uq.edu.au; {yan.wang, mehmet.orgun}@mq.edu.au

Abstract—Graph Pattern Matching (GPM) plays a significantrole in social network analysis, which has been widely used in,for example, experts finding, social community mining and socialposition detection. Given a pattern graph GQ and a data graphGD , a GPM algorithm finds those subgraphs, GM , that matchGQ in GD . However, the existing GPM methods do not considerthe multiple constraints on edges in GQ, which are commonlyexist in various applications such as, crowdsourcing travel, socialnetwork based e-commerce and study group selection, etc.

In this paper, we first conceptually extend Bounded Simulationto Multi-Constrained Simulation (MCS), and propose a novelNP-Complete Multi-Constrained Graph Pattern Matching (MC-GPM) problem. Then, to address the efficiency issue in large-scale MC-GPM, we propose a new concept called Strong SocialComponent (SSC), consisting of participants with strong socialconnections. We also propose an approach to identify SSCs, andpropose a novel index method and a graph compression methodfor SSC. Moreover, we devise a heuristic algorithm to identifyMC-GPM results effectively and efficiently without decompress-ing graphs. An extensive empirical study on five real-world large-scale social graphs has demonstrated the effectiveness, efficiencyand scalability of our approach.

I. INTRODUCTION

A. Background

Online Social Networks (OSNs) have attracted billionsof users worldwide, which have provided rich informationfor users to perform various tasks such as detecting socialpositions [1], finding experts [2], [3] and making travel plans[2]. These tasks are typically conducted by applying GraphPattern Matching (GPM) which has been widely used insocial network analysis. GPM is typically defined in terms ofsubgraph isomorphism, in which, given a data graph GD anda pattern graph GQ as input, it answers whether GD containsa subgraph that is isomorphic to GQ.Example 1: Fig. 1 contains three query pattern graphs GQ1,GQ2 and GQ3, and four data graphs GD1, GD2, GD3 andGD4. GQ1 is isomorphic to GD1, but is not isomorphic toGD2, GD3 or GD4.

However, as shown in [2], the conventional subgraph iso-morphism is too strictly defined to find useful patterns in thereal-world social graphs. Moreover, due to the NP-complete

A

B

C

A

B

C

2

3 *

A

D

C

B

A

B

C

A

B

C

B

GQ1 GQ2 GD1 GD2 GD3

A

B

C

a>m

*

GQ3

A

D

C

B

b<n2

3

a

b

a

b

a b

a

b

GD4

C

Fig. 1. Pattern graphs and data graphs

time complexity, it is hard to apply isomorphism test to large-scale social graphs.

In order to address the above-mentioned issues in subgraphisomorphism, graph simulation [4] has been proposed whichhas less restrictions but more capacity to extract more use-ful subgraphs with better efficiency. In contrast to subgraphisomorphism, graph simulation supports simulation relationsinstead of exact match of vertices. Recalling Fig. 1, GQ1 is notisomorphic to GD2, but it matches GD2 via graph simulationas B and C in GQ1 can be simulated to one out of B and Cin GD2 respectively. Graph simulation has been widely usedin structural index and website classification, but it still needsto perform edge-to-edge mapping (e.g., the edge (A,C) and(B,C) in the case of graph simulation in Fig. 1). This is stilltoo strict for some real applications that utilize the connectivitybetween vertex pairs via a path with arbitrary or pre-definedlengths [5], [6] (e.g., path lengths 2 and 3 in GQ2).

To address this issue in graph simulation, Fan et al., [2]proposed bounded simulation, wherein each vertex has a labelof a category, and each edge is labeled with either a constantk or a ∗. Graph matching based on bounded simulation mapsedges in a pattern graph to paths within bounded lengths ina data graph, instead of edge-to-edge mappings in subgraphisomorphism and graph simulation [2]. As shown in theexample in Fig. 1, GQ2 matches GD1, GD2 and GD3 viabounded simulations. Based on bounded simulation, some

978-1-4799-7964-6/15/$31.00 © 2015 IEEE ICDE Conference 2015351

studies have been conducted to restrict the simulation structure[7], match with different edge types [8], and improve theefficiency of bounded simulation based GPM [9], [10].

B. Problem

The bounded simulation based GPM only considers thebounded path length of an edge when matching the edges,which greedily finds the subgraph that has the minimal di-ameter. However, some graphs have attributes on vertices andedges, like the Contextual Social Graph (CSG) [11], whereeach vertex has the social role information, and each edgehas the social relationships and social trust information. In avariety of applications in social networks, e.g., crowd-sourcingtravel [12], study group selection (classroomsalon.com), andsocial network based e-commerce [11], people are willing toincorporate multiple social contexts associated with edges ina CSG, which have significant influence on people’s collabo-rations and decision making [13].Example 2: Consider GQ3 and GD4 in Fig. 1, where inaddition to the traditional graph structure, each edge in GD4

is associated with two attributes: a and b that can be socialtrust and social relationships between people in CSGs. Inreal applications on CSGs, the constraints of social trust (e.g.,a > m in GQ3) and social relationships (e.g., b < n in GQ3)can be specified between, for example, a Project Manager Aand an Assistant Manager B to find a trustworthy team, orbetween two customers A and B to help retailers find loyalcustomers in social network based CRM (Customer RelationManagement) systems. In such multi-constrained graph patternmatching, a path in GD4 is a match of (A,B) in GQ3, if thepath length is no greater than 2, and the aggregated values ofa and b can satisfy the multiple constraints, i.e., a > m andb < n, (m and n are constants).

This example illustrates that a new type of Multi-Constrained GPM (MC-GPM) is an important issue in socialgraphs. MC-GPM subsumes the classical NP-Complete multi-constrained path selection problem [14], and thus is NP-Complete. So, the main challenge of our work is to designnovel indexing structures and approximation techniques toeffectively and efficiently support MC-GPM queries. To thebest of our knowledge, all the existing methods of GPM do notsupport MC-GPM queires. Our contributions are summarizedas follows.

C. Contributions

(1) We propose a new notion of Multiple-Constrainted Sim-ulation (MCS) by extending bounded simulation. In contrastto its traditional counterpart, the MCS based MC-GPM is tofind a graph pattern matching result, where each edge of thematching graph satisfies both the bounded path length and themultiple constraints on edges, which can better support manyemerging social network based applications.

(2) We propose a concept called Strong Social Component(SSC), which consists of participants who have strong socialconnections, and propose an approach to identify SSCs. Asthe social connections in SSC usually stay stable in a very

long period of time [15], we propose a novel index structureand a graph compression method for SSC with polynomialtime complexity. Our method can match the pattern graphwithout any graph decompression, which can reduce storageconsumption and improve efficiency.

(3) Based on the indices and compressed graph, we pro-pose a Heuristic Algorithm for MC-GPM, called HAMC. InHAMC, a novel objective function is proposed to test if anedge matching is included in a data graph. HAMC has thetime complexity of O(EQNDlogND + EQED), where NDand ED are the number of vertices and edges respectively inthe data graph, and EQ is the number of edges in the querygraph.

(4) An extensive empirical study based on five large-scalereal-world social graphs has demonstrated the effectiveness,efficiency and scalability of our proposed algorithms.

The rest of this paper is organized as follows. We firstreview the related work on GPM in Section II. Then we intro-duce the necessary concepts and formulate the focal problemof this paper in Section III. The strong social componentidentification is presented in Section IV, followed by thegraph compression methods and index structures proposed inSection V and Section VI respectively. Section VII presentsour proposed MC-GPM algorithm, Section VIII reports theexperimental observations, and Section IX concludes the pa-per.

II. RELATED WORK

In the literature, GPM methods have already been widelystudied, which can be categorized into (1) the isomorphism-based GPM to match each of the vertices and edges in GQexactly, and (2) the simulation-based GPM to simulate patternmatching of the vertices and edges in GQ. Below we analyzethem in detail.

Isomorphism-Based GPM: In the studies of isomorphism-based GPM, Zou et al., [6] index the shortest path lengthbetween any two vertices in a data graph to support therequirement of the connection length. In addition, Sun et al.,[16] adopt a graph exploration method to improve the effi-ciency of subgraph joint processing in graph pattern matching.Furthermore, Cheng et al., [17] propose a top-k graph patternmatching approach which builds up a spanning tree of a cyclicgraph query, and rank the answers by the sum of the edgelengths of an answer.

Given a query graph GQ, usually it is not realistic to finda isomorphic subgraph as the strict matching. Yan et al.,[18] propose a similarity based method, where a distance iscomputed based on the total number of the matching edgesbetween a query graph and the data graph. If the distance isless than a specified threshold, then an answer is returned.In addition, Shang et al., [19] further improve the similarity-based method by indexing graphs according to their similarityto the features in GQ, which can prune those non-promisinggraphs. Furthermore, Zhu et al., [20] divide a GD into severalgroups of similar graphs and index these graphs to supporteffective non-promising graph pruning.

352

In order to improve the efficiency of the GPM in largedata graphs, some paralleled and distributed GPM methodshave been proposed recently. In [21], [22] a pattern graphis decomposed into several small patterns, and they find thesubgraph matchings for each small pattern graph and jointhe intermediate results finally. In addition, in [23], they findmultiple GPM answers based on a paralleled framework.Furthermore, in [24], a large data graph is decomposed intoseveral small fractions based on a distributed GPM method,and then they predict the probability of each fraction to havinga GPM answer.

Simulation-Based GPM: As isomorphism-based GPM isstill too strict for some applications, based on graph simulation[4], Fan et al., [2] propose bounded simulation based GPMmethod, where the label of each vertex is not unique anda bounded length can be specified on the edges in a querygraph. In bounded simulation, it is not necessary to exactlymatch each vertex and each edge as the subgraph isomorphism,instead, matching any vertex having the same labels and thepath whose length is not greater than the bounded length ina data graph. This type of GPM can be conducted in cubic-time. In addition, based on bounded simulation, Ma et al., [7]propose strong simulation to find a small set of matches whosetopologies are more similar with the query graph than boundedsimulation. Moreover, Fan et al., [8] further consider therequirements of different types of edges in GPM. Furthermore,in order to improve efficiency, Fan et al., [10] later propose agraph pattern view based bounded simulation, where a set ofviews are defined in a data graph, and an estimation methodhas been proposed to predict which view can be used to answera specific query. Moreover, they propose a resource-boundedquery [25] where a fraction of a data graph that has a highprobability of containing the query graph is extracted. Finally,Fan et al., [9] propose a method to find the top-k matching ofthe specific vertex patterns, where some patterns that containimportant vertices and edges have high priority to be matchedin a query graph.

Summary: The isomorphism-based GPM are importantin many applications, e.g., 3D object matching [26] andprotein structure matching [27]. The index, and the paralleledand distributed methods are effective ways to improve theefficiency of GPM. However, such GPM is still sufferingexpensive computation cost as it is NP-Complete. Moreover,although the similarity-based method enhance the probabilityof returning an answer, it still too strict to use in some applica-tions, e.g., finding social experts [2] and project organization[9]. Simulation-based GPM methods relax the restrictionsof subgraph isomorphism and thus well address the GPMin these applications. But all the existing methods do notconsider the multiple constraints on edges in a graph query.Such a query is popular and fundamental in many socialnetwork based applications, like crowd-sourcing travel [12],study group selection (classroomsalon.com), and social net-work based e-commerce [11]. Therefore, the existing methodscannot support the significant MC-GPM in many applications.

A B

C2

D

C1

F

GD5

TDi

AC1= 0.7

rAC1 = 0.6

ρDi

C1= 0.7

ρDi

A = 0.95 ρDi

B = 0.9

ρDi

C2= 0.9

ρDi

F = 0.4ρDi

D = 0.8

rC1C2 = 0.9

TDi

C1C2= 0.8

TDi

FD = 0.4

rFD = 0.6

PMSPM

PG PG

AMAM

TDi

BC2= 0.7

rBC2 = 0.8

TDi

C1D= 0.9

rC1D = 0.9

TDi

C2F= 0.8

rC2F = 0.7

A B

C

D

1

2 3

*

GQ4

PG

AM

PMSPM

λDiρ = 0.7

λDiρ = 0.7

λDi

T = 0.5

λr = 0.7

λr = 0.6

λDi

T = 0.5

Fig. 2. Multiple-Constrained GPM in CSGs

III. PRELIMINARIES

A. Data Graph

1) Contextual Social Graph: A Contextual Social Graph(CSG) [11] is a labeled directed graph G = (V,E,LV, LE),where• V is a set of vertices;• E is a set of edges, and (vi, vj) ∈ E denotes a directed

edge from vertex vi to vertex vj ;• LV is a function defined on V such that for each vertexv in V , LV (v) is a set of labels for v. Intuitively, thevertex labels may for example represent social roles in aspecific domain;

• LE is a function defined on E such that for each link(vi, vj) in E, LE(vi, vj) is a set of labels for (vi, vj), likesocial relationships and social trust in a specific domain.

Example 3: We now give a specific example of a CSG from[11] to show how the vertex and edge labels can be definedand used. GD5 in Fig. 2 is a CSG, where each vertex vi ∈ V isassociated with a role impact factor, denoted as ρDi

vi ∈ [0, 1],to illustrate the impact of participant vi in domain i, which isdetermined by the expertise of vi. ρDi

vi = 1 indicates that vi isa domain expert in domain i while ρDi

vi = 0 indicates that vihas no knowledge in that domain. Moreover, each edge (vi, vj)is associated with social trust, denoted as TDi

vi,vj ∈ [0, 1], andsocial intimacy degree, denoted as rvi,vj ∈ [0, 1], to illustratetrust and intimacy social relationships between participants.T, r and ρ are called social impact factors, whose values canbe extracted by using the data mining techniques [28], [29],[30].

Based on the theories in Social Psychology [15], we adoptthe multiplication method to aggregate T and r values of apath, and adopt the average method to aggregate the ρ valuesof the vertices in a path. The details of the aggregation methodhas been discussed in [11]. The aggregated values of a pathp in domain i is denoted as ASDi(p) = {ATDi(p), Ar(p),AρDi(p)}. If each of the aggregated social impact factor valueof p is greater than the corresponding one of path p′, then pdominates p′ in domain i, which is denoted as p >Di

DOM p′.

B. Pattern Graph

A Pattern Graph is defined as GQ = (Vq, Eq, fv, fe, se),where

353

• Vq and Ep are the set of vertices and the set of directededges, respectively;

• fv is a function defined on Vq such that for each vertexu, fv(u) is the vertex label of u;

• fe is a function defined on Eq such that for each edge(u, u′); fe(u, u′) is the bounded length of (u, u′) whichis either a positive integer k or a symbol *;

• se is a function defined on Eq such that for each edge(u, u′), se(u, u′) is the multiple constraints of the aggre-gated social impact factor values of (u, u′) representedby, λDi

T , λr and λDiρ , which are in the scope [0,1];

From GQ4 in Fig. 2, we can see the constraints, i.e., λDi

T , λrand λDi

ρ on edges (B,C) and (B,D), respectively.

C. Multi-Constrained Graph Pattern Matching (MC-GPM)

In this section, we introduce MC-GPM via Multi-Constrained Simulation (MCS) in CSGs.

Bounded Simulation [2]: Given a data graph G =(V,E,LV ) and a pattern graph Q = (Vq, Eq, fv, fe), a datagraph G matches a pattern graph Q via bounded simula-tion, denoted as Q EB

simG, if there exists a binary relationS ⊆ VQ × V such that• for all u ∈ VQ, there exists v ∈ V such that (u, v) ∈ S;• for each pair (u, v) ∈ S,

– u ∼ v, and– for each edge (u, u′) in EQ, there exists a nonempty

path p from v to v′ in G such that (u′, v′) ∈ S, andSlen(p) 6 k if fe(u, u′) = k.

Then S is a match in G for Q via bounded simulation.Multi-Constrained Simulation (MCS): MCS is a non-

trivial extension of bounded simulation. Consider a datagraph GD = (V,E, LV, LE) and a pattern graph GQ =(Vq, Eq, fv, fe, se). GD matches GQ via MCS, denoted byGQ EMC

simGD, if there exists a binary relation S ⊆ VQ × Vsuch that• for all u ∈ VQ,there exists v ∈ V such that (u, v) ∈ S;• for each pair (u, v) ∈ S,

– u ∼ v, and– for each edge (u, u′) in EQ, there exists a nonempty

path p from v to v′ in G such that (u′, v′) ∈ S, andSlen(p) 6 k, if fe(u, u′) = k;

– ATDi(v, v′) > λT , Ar(v, v′) > λr andAρDi(v, v′) > λρ, if se(u, u′) = {λT , λr, λρ};

Then S is a match in GD for GQ via multi-constrainedsimulation.

If an edge (u, u′) in GQ is mapped to a nonempty path pfrom v to v′ in GD based on MCS, then (v, v′) is an edgepattern matching (u, u′) in GD (denoted as (v, v′, GD) '(u, u′, GQ)), and (u, v) ∈ S. If for each edge in GQ, there isa matching edge in GD, then an MC-GPM answer is returned(denoted as GM = (V,E, LV, LE), GM ⊆ GD).Example 4: Suppose GQ4 in Fig. 2 is a query given by auser to select a group of participants from a CSG to finish aproject. Based on data graph GD5, we can get the MC-GPManswer as (1) vertex SPM (i.e., A, a Senior Project Manager)

and vertex PM (i.e., B, a Project Manager) in GQ4 can bemapped to the same vertices SPM and PM in GD5, whichis part of subgraph isomorphism; (2) the vertex AM (i.e., C,an Assistant Manager) in GQ4 corresponds to multiple AMs(i.e., C1 and C2) in GD5. This relationship can be captured byusing graph simulation; (3) the edge with a bounded length inGQ4 can be mapped to a path length in GD5 by using boundedsimulation; and (4) the edge with multiple constraints can bemapped to the aggregated social impact factor values of a pathin GD5 by using multi-constrained simulation.

D. A Baseline Algorithm

As there are no existing methods for MC-GPM, and MCSis an extension of bounded simulation, in this section, weintroduce a baseline method (named as BaseLine) that extendsthe GPM algorithm via bounded simulation. The details ofBaseLine are as follows,

Step 1: For each constrained edge in GQ, compute themapped shortest path length, Slen(p), in GD.

Step 2: If Slen(p) is no greater than the bounded length(denoted as Blen) specified on the corresponding edge, i.e.,Slen(p) 6 Blen(p), investigate if the aggregated socialimpact factor values of p can satisfy the corresponding con-straints.• If each of the constraints can be satisfied, an edge

matching result is returned.• Otherwise, compute the second shortest path length based

on the top-k shortest path selection algorithm [31] and goto Step 2.

Step 3: After investigating all the edges in GQ, if there isan edge matching for each of the edge in GQ, return an answerby the exploration based graph pattern matching method (seedetails in Section VII-B). Otherwise, there are no GPM answersincluded in GD.Example 5: Consider the case of MC-GPM shown inFig. 2, BaseLine can return the edge mapping answer,(A,C1, GD5) ' (A,C,GQ4), and (A,C1, D,GD5) '(A,D,GQ4). Then it returns the edge mapping answerof (B,C) in GQ4 are (B,C2, GD5) ' (B,C) and(B,C2, C1, GD5) ' (B,C,GQ4), and the edge mappingof (B,D) is (B,D,GQ4) ' (B,C2, C1, GD5) because theconstraints cannot be satisfied by another path via B,C2, Fand D, i.e., ATDi(P(B,C2,F,D)) < 0.6. As there is atleast an edge matching result for each edge in GQ4, anMC-GPM answer can be returned by using the explo-ration based GPM method. The matching graph is GM =(V,E, LV, LE), where V = {A,B,C1, C2, D} and E ={(A,C1), (B,C2), (C2, C1), (C1, D)}.

Suppose for each edge in GQ, BaseLine needs to performthe Dijkstra’s algorithm in GD for N times. Then the timecomplexity of BaseLine is O(EQNNDlogND +NEQED).

IV. STRONG SOCIAL COMPONENT

In order to enhance the efficiency and effectiveness of ourMC-GPM method, in this section, we propose a strong socialcomponent identification method. In graph theory [32], a graph

354

B

E

C

FD

G H

strong social component

in domain !strong social component

in domain "

TDj

CF = 0.95

TDj

FH = 0.9TDj

EH = 0.98

TDi

BD = 0.89

TDi

DG = 0.9

rBD = 0.88

rDG = 0.98

rCE = 0.92

rEH = 0.96

rCF = 0.9

rFH = 0.9

ρDi

B = 0.9

ρDi

D = 0.93

ρDi

G = 0.89

ρDj

E = 0.9

ρDj

C = 0.98

ρDj

F = 0.82

ρDj

H = 0.86

TDj

CE = 0.98

Fig. 3. An example of strong social component

G is said to be strongly connected if every vertex is reachablefrom every other vertex, and a strongly connected componentof a directed graph G is a subgraph that is strongly connected.Based on the definition of the strong connection, we give thedefinition of Strong Social Component as below.Definition 1: Strong Social Component. In a CSG, a sub-graph is said to be socially strongly connected if each vertexassociated with a high role impact factor value in a specificdomain is connected with the edges associated with intimatesocial relationships and strong social trust relationships. AStrong Social Component (SSC) is a subgraph that is sociallystrongly connected.Example 6: In an SSC, suppose the T, r and ρ valuesassociated with each of the vertices and edges should begreater than 0.8. Fig. 3 depicts a graph that has two strongsocial components in domain i and domain j respectively,where the T, r and ρ values are greater than 0.8.

Based on the theories in Social Psychology [15], in an SSC,the social structure and the social contexts, including the socialtrust and social relationships on edges, and the social rolesassociated with vertices usually stay stable in a very longperiod of time. This property makes it is realistic to indexand compress the graph in an SSC with low update cost.

Identifying all the SSCs in a specific domain subsumes theclassical NP-Complete maximum clique problem [32], whichis very time consuming. Alternatively, we can identify up toK SSCs for MC-GPM. We propose an SSC identificationmethod which first randomly selects K vertices that areassociated with high role impact factor values as the seeds.Then from each of the seeds, our algorithm adopts the Breadth-First Search (BFS) method to find the vertices associatedwith high role impact factor values connected by the edgesassociated with high social intimacy degrees and social trustvalues. In the worst case, our method needs to visit all thevertices and edges in a data graph. The time complexity ofthe SSC identification is O(NDED). The pseudo-code of thealgorithm is shown in Algorithm 1.

V. CONTEXT-PRESERVED GRAPH COMPRESSION FOR SSC

In this section, based on the existing graph compressionmethod for bounded simulation [33], we propose a context-aware graph compression method, where the reachability,graph pattern and social contexts are preserved. Moreover, thegraphs compressed by our approach can be directly queried

Algorithm 1: Strong Social Component IdentificationData: CSG: G(V,E, LV, LE), k, λV , λE

Result: SSCi (i ∈ [1, k])begin1

Select vi i ∈ [1, k], LV (vi) > λV ;2Set vi.visit = 0;3Put vi into ExpSet;4while ExpSet 6= ∅ do5

Get vi from ExpSet;6Remove vi from ExpSet;7if vi.visit = 0 then8

Set vi.visit = 1;9for each vj , (vj is neighbour vertices of vi) do10

if LV (vj) > λV and LE(vi, vj) > λE then11Put vj into ExpSet;12Add vj and E(vi, vj) into SSC;13

end

without any decompression. In contrast, the existing com-pression methods are not designed for solving the MC-GPMproblem, and thus they cannot preserve the social contextinformation. Rather, the existing approaches have to restorethe original graph from compact structures to answer a graphpattern query.

A. Compression for Reachability

A reachability query for a pair of vertices in a pattern graphGQ is to investigate if there exists at least one path linkingthe two vertices in a data graph, e.g., (B,C) of GQ3 in Fig.1. The graph compression property captured by Theorem 1preserves reachability information, which is called reachabilitypreserved compression, denoted as GRD.Theorem 1: The compressed graph is reachability preservedwhen the compressed two vertices have the same ancestorsand can reach the descendants of each other.Proof: Suppose there is a data graph GD = (V,E),where V = {A,B1, ..., Bn, C1, ..., Cn, D1, D2} andE = {(A,B1), ..., (A,Bn), (A,C1), ..., (A,Cn), (B1, D1), ..., (Bn, D1), (C1, D2), ..., (Cn, D2), (D1, D2), (D2, D1)}.GRD = (V,E), where is V = {A,B1...nC1...n, D1, D2} andE = {(A,B1...nC1...n), (B1...nC1...n, D1), (D1, D2)}. For areachability query GRD = (V,E), where V = {A,Bi, Dj},i ∈ [1, n] and j ∈ [1, 2], GRD is not reachabilitypreserved compression if and only if A and Bi,or Bi and Dj is not reachable, which contradictsE = {(A,B1...nC1...n), (B1...nC1...n, D1), (D1, D2)} inGRD. Therefore, Theorem 1 is proven. �Example 7: Fig. 4 contains two groups of graphs1. ConsiderFig. 4 (a), where GD7 is the original data graph and GRD7

is the compressed graph. From GD7, we can see that bothA and B do not have any ancestors, and they can reach thesame descendants (C, D and E). Therefore, A and B can becompressed as one vertex in GRD7, where the reachability ofA and B to other vertices is preserved.

1As the compressions do not consider any social context, in order to clearlydisplay the graph structure, the social contexts of the graphs are not shownin this example.

355

A B

C

E

D

A B

C

D

E

A

B1

C

B2

A

B1

C

B2

GD7 GRD7 GD8 GP

D8

(a) reachability preserved (b) graph pattern preserved

Fig. 4. Reachability preserved and graph pattern perserved compressions

B. Compression for Graph Pattern

In addition to reachability preserved compression, to supportthe graph pattern query, e.g., (A,D) of GQ4 in Fig. 2,we propose a graph compression method that can preservesuch graph patterns, which is called graph pattern preservedcompression, denoted as GPD. The property the compressionmethod is captured by Theorem 2.Theorem 2: The compressed graph is graph pattern preservedwhen the compressed two vertices have the same label, thesame ancestors and the same descendants.Proof: Suppose B1, ..., Bn are vertices in GD. A and C aretheir ancestor and descendant respectively. Then we can havea compressed graph GPD, where B1, ..., Bn are compressed asa single vertex B. If GQ = (V,E), where V = {A,B,C} andE = {(A,B), (B,C)}, GQEB

simGD. If GPD is not graph pat-tern preserved, for an edge (u, u′) in GQ, there exists an edge(v, v′) in GPD such that (u′, v′) /∈ S, (S ⊆ GQ×GD). Namely,there exists vertex Bi such that (A,Bi, GD) 6' (A,B,GPD) or(Bi, C,GD) 6' (B,C,GPD). This contradicts the assumptionthat B,..., Bn have the same label. Therefore, Theorem 2 isproven. �Example 8: Consider the example shown in Fig. 4 thatcontains a data graph GD8 and the corresponding compressedgraph GPD8. In GD8, we can see that B1 and B2 have the sameancestor A and the same descendant C. Therefore, based onTheorem 2, GPD8 is a graph pattern preserved compression ofGD8.Theorem 3: The graph pattern preserved compression isreachability preservedProof: As illustrated in Theorem 1 and Theorem 2, thecompression condition of graph pattern preserved compressionis more strict than that of reachability preserved compression.Therefore, Theorem 3 is proven. �

C. Compression for Social Contexts

In order to support MC-GPM, e.g., (B,D) of GQ4 inFig. 2, we propose a graph compression method that canpreserve social context information, which is called socialcontext preserved compression, denoted as GSD. The propertythe compression method is captured by Theorem 4.Theorem 4: The compressed graph is social context preservedwhen the compressed two vertices have the same label, the

A

B1

D

(C1, C2)

A

B1

D

GD9

C1 C2

TDi

AC2= 0.75

TDi

C2D= 0.6

TDi

AB1= 0.55

TDi

AC1= 0.78

TDi

C1D= 0.78

ρDi

C1= 0.82 ρDi

C2= 0.65ρDi

B = 0.72

rC1D = 0.82

rAC1 = 0.8

rAC2 = 0.78rAB1 = 0.88

rB1D = 0.73

TDi

B1D= 0.65

rC2D = 0.78

ρDi

A = 0.87

ρDi

D = 0.67

GSD9

ρDi

A = 0.87

ρDi

C1C2= 0.82

TDi

A,C1C2= 0.78

TDi

C1C2,D= 0.78

ρDi

B = 0.72

ρDi

D = 0.67

TDi

B1,D= 0.65

TDi

A,B1= 0.55

rA,B1 = 0.88

rB1,D = 0.73rC1C2,D = 0.82

rA,C1C2 = 0.8

Fig. 5. Social context preserved compression

same ancestors, the same descendants, and AS of the pathvia one of the vertex dominates that of the other one.Proof: Suppose B1, ..., Bn have the same ancestor A andthe same descendant C in GD. AS(A,C) via Bi (1 ≤i ≤ n) dominates others. Then, GD is compressed asGSD = (V,E), where V = {A,B1...Bn, C}, E ={(A,B1...Bn), (B1...Bn, C)} and AS(A,C) via B1...Bnequals to AS(A,C) via Bi. Given a query GQ for (A,C)with Blen(A,C) and the multiple constraints of social im-pact factors on edge (A,C), only the constraints on edge(A,B) will be investigated as Slen(A,C) = 2 via anyBk (1 ≤ k ≤ n). If GSD is not social context preserved,then the edge matching (A,C,GQ) ' (A,C,GD) is missingin GSD, namely one of the social impact factor in AS(A,C)via Bj (1 ≤ j ≤ n, and i 6= j) in GD is greater than thatof (A,C) in GSD (i.e., AS(A,C) via Bi). This contradicts theassumption that AS(A,C) via Bi (1 ≤ i ≤ n) dominatesothers. Therefore, Theorem 4 is proven. �Example 9: Consider the example shown in Fig. 5 whichcontains a data graph GD9 and the corresponding compressedgraph GSD9. In GD9, we can see that C1 and C2 have the sameancestor A and same descendant D. In addition, AS(A,D) viaC1 dominates AS(A,D) via C2. Then, based on Theorem 4,GD9 is compressed as GSD9, where C1 and C2 in GD9 arecompressed as one vertex and AS(A,D) via C1C2 in GSD9

equals to the dominated one in GD9, i.e., AS(A,D) via C1.Then GSD9 is a social context preserved.Theorem 5: The social context preserved compression isgraph pattern preserved and reachability preservedProof: As illustrated in Theorem 3 and Theorem 4, the com-pression condition of social context preserved compressionis more strict than that of that of graph pattern preservedcompression. Therefore, Theorem 5 is proven. �

D. Summary

Our graph compression method can preserve importantinformation of a CSG, including reachability, graph patternand social contexts information. Thus, a graph pattern queryof MC-GPM can be answered based on the compressed datagraph without any decompression. By compressing verticesand edges in the graph of an SSC, our graph compressionmethod can greatly save the memory and query processingtime (see details in experiments). In addition, in the worstcase, our compression method needs to visit all the vertices

356

and edges in a data graph. Therefore, the time complexityof our compression method is O(NDED). As mentioned inSection IV, the structure and the social contexts of the graph inan SSC usually stay stable in a very long period of time [15].Therefore, it is not necessary to update the compressed datagraph frequently. When there are some changes of the socialcontexts and/or graph structure in an SSC, some efficientincremental graph compression methods in the literature canbe adopted [10], [33]. These methods investigate if the changesaffect the GPM in the compressed graph, and they onlyrecompress a small subgraph that is affected by the changes,instead of recompressing the whole data graph.

VI. INDEX OF STRONG SOCIAL COMPONENTS

In order to improve the efficiency of MC-GPM, we proposea novel index structure to index the reachability, graph patternand social contexts in the compressed graphs.

A. Reachability Index

This index records a list of vertices that one can researchanother in a graph, where the index of each vertex containsthe ancestors and predecessors of the vertex.Example 10: Fig. 6 is an example of our index for the SSCin domain j of the graph depicted in Fig. 3. From the figure,we can see the the indices of each vertex include three parts:they are reachability index, graph pattern index and socialcontext index. We take vertex E as an example, as it hasboth ancestors and descendants. The reachability index of Erecords its ancestor C (i.e., Anc.: C), and its descendant H(i.e., Des.: H). Similarly, we construct the reachability indexfor each of the other vertices of the graph.

Given a reachability query, e.g., (B,C) of GQ3 in Fig. 1, ifthe query vertices are included in the SSC, we can investigatethe reachability immediately, greatly saving query processingtime.

B. Graph Pattern Index

After indexing the reachability information, we furtherindex the graph pattern information to improve the efficiencyof graph pattern queries. This index records the shortest pathlength between any two vertices in the graph of an SSC.Example 11: Consider the graph pattern index shown inFig. 6. For vertex E, in addition to index the reachabilityinformation, the graph pattern index records the shortest pathlength from its ancestor C to E (i.e., Slen = 1), and from Eto its descendant H (i.e., Slen = 1). Similarly, we constructthe graph pattern index for each of the other vertices.

Given a query of a graph pattern with the bounded length,e.g., (A,D) of GQ4 in Fig. 2, based on the graph pattern index,we can investigate if the indexed path length is greater thanthe bounded length, and thus can efficiently answer a query.

C. Social Context Index

In order to improve the efficiency of MC-GPM, we constructthe social context index to record the maximal aggregatedsocial impact factor values of the mapped paths in a data graph.Below are the details of the index.

E

C

F

H

Des.: E, Slen = 1, P len = 1, AS = {0.98, 0.92, 0.94}Des.: F, Slen = 1, P len = 1, AS = {0.95, 0.9, 0.9}Des.: H,Slen = 1, P len = 2, AS = {0.96, 0.88, 0.91}

Anc.: C, Slen = 1, P len = 2, AS = {0.96, 0.88, 0.91}Anc.: E, Slen = 1, P len = 1, AS = {0.98, 0.96, 0.88}Anc.: F, Slen = 1, P len = 1, AS = {0.9, 0.9, 0.84}

Des.: H, Slen = 1, P len = 1, AS = {0.98, 0.96, 0.88}Anc.: C, Slen = 1, P len = 1, AS = {0.98, 0.92, 0.94}

Des.: H, Slen = 1, P len = 2, AS = {0.9, 0.9, 0.84}Anc.: C, Slen = 1, P len = 1, AS = {0.95, 0.9, 0.9}

reachability index

graph pattern index

social context index

TDj

CF = 0.95

TDj

FH = 0.9

TDj

CE = 0.98

TDj

EH = 0.98

rCF = 0.9

rFH = 0.9

rCE = 0.92

rEH = 0.96

ρDj

C = 0.98

ρDj

E = 0.9ρDj

F = 0.82

ρDj

H = 0.86

Fig. 6. The index of SSC in domain j

• If each of the aggregated T , r and ρ values of one of thepaths between two vertices dominates others, we indexthat path length and the corresponding aggregated socialimpact factor values.

• Otherwise, we index up to three paths that have themaximal aggregated T , r and ρ values respectively.

Example 12: Consider the social context index shown in Fig.6. Here we take the vertex C as an example, where there aretwo paths from C to its descendant H , e.g., path p1(C,E,H)

and p2(C,F,H). As AS(p1(C,E,H)) dominates AS(p2(C,F,H)).Then, we index AS(p1(C,E,H))={0.96,0.88,0.91} and its pathlength Plen(p1(C,E,H)) = 2 at C. Similarly, we construct thesocial context index for each of the other vertices.

Given a graph pattern query with multiple constraints, e.g.,(B,D) of GQ4 in Fig. 2, based on the social context index, wecan quickly investigate if there exists an edge pattern matchingin the data graph, and thus saving query processing time.

D. Summary

The above three indices record important information ofthe graph in an SSC. Given an MC-GPM query, if the twovertices of a query edge can be mapped into a path in SSCs,this indexed information can be used to quickly investigate ifthere is an edge pattern matching, and thus greatly savingquery processing time (see details in the experiments). Inaddition, in the worst case, we need to perform the Dijkstra’salgorithm four times, and thus the time complexity of theindices construction is O(NDlogND + ED). Furthermore, asmentioned in Section IV, the structure and the social contextsof the graph in an SSC usually stay stable in a very longperiod of time. Therefore, usually it is not necessary toupdate the indices frequently, which reduces the cost of indexmaintenance. Moreover, the index update methods have beenwell developed in the literature [16], [20]. Thus, we would notdiscuss it in detail.

VII. AN MULTI-CONSTRAINED GRAPH PATTERNMATCHING ALGORITHM

Based on the compressed data graph and the indices inSSCs, we propose a novel heuristic algorithm for MC-GPM, called HAMC, with our proposed novel heuristic searchstrategies. HAMC firstly performs a Multi-Constrained Edge

357

Algorithm 2: MC-EPMData: (v, v′), (v, v′) ∈ GQ, GD

Result: (v∗, v∗∗), (v∗, v∗∗) ∈ GD and (v∗, v∗∗GD) ' (v, v′, GQ)begin1

while ∃v∗, v∗.visit = false do2Compute minδ(v∗, v∗∗);3if minδ(v∗, v∗∗) > 1;4then

return (v∗, v∗∗);5

else6v∗.visit = true;7

return ∅;8

end

Algorithm 3: Exploration-based graph pattern matchingData: GD, GQ

Result: GM

begin1GM = ∅ and Gtemp = ∅;2while ∃(v, v′) ∈ GQ, (v, v

′).explore = false doSelect (sv, vj) from GQ;3if MC− EPM(sv, vj) 6= ∅ then4

(sv, vj).explore = true;5(va, vb) = MC − EPM(sv, vj);6Gtemp = (V,E, LV, LE), where V = {va, vb} and7E = {(va, vb)};GM = GM

⋃Gtemp;8

Gtemp = ∅;9if vj is not a leaf vertex then10

Select (sv, vk) from GQ;if (sv, vk).explore = false andMC − EPM(sv, vk) 6= ∅ then

(va, vc) = MC − EPM(sv, vk);11Gtemp = (V,E, LV, LE), V = {va, vc} and12E = {(va, vc)};GM = GM

⋃Gtemp;13

Gtemp = ∅;14(sv, vk).explore = true;15

elseSelect (vj , vm) from GQ;16if (vj , vm).explore = false andMC − EPM(vj , vm) 6= ∅ then

(vd, ve) = MC − EPM(vj , vm);17Gtemp = (V,E, LV, LE), V = {vd, ve} and18E = {(vd, ve)};GM = GM

⋃Gtemp;19

Gtemp = ∅;20(vj , vm).explore = true;21

return GM ;22end

Pattern Matching (MC-EPM) based on the novel objectivefunction to investigate if there is an edge pattern matching inthe data graph. If an answer is returned, HAMC then performsan exploration-based GPM to answer an MC-GPM query.

A. Multi-Constrained Edge Pattern Matching

Multi-Constrained Edge Pattern Matching (MC-EPM)method first investigates if an edge pattern query in GQ canbe mapped into a path in SSCs. If so, MC-EPM checks theindices of each vertex in an SSC to determine if there is anedge pattern matching. Otherwise, it performs the Dijkstra’salgorithm to investigate the minimal value of the objectivefunction in Eq. (1), which can help find an edge matching ifone exists in GD.

δ(p) , max{(λDi

Tp

ATDip

), (λrpArp

), (λDiρp

AρDip

), (Plen

Blen)} (1)

where ATDip , Arp and AρDi

p are the aggregated social impactfactor values of path p; λDi

Tp, λrp and λDi

ρp are the correspond-ing constraints; Blen and Plen are the bounded path lengthand the identified path length respectively.

From the objective function, we can see that if an edgepattern query can be mapped into a path p in a data graph,δ(p) 6 1. Otherwise δ(p) > 1. Based on this property, HAMCadopts the Dijkstra’s algorithm to identify the path with theminimal δ value (denoted as δmin). If δmin(p) 6 1, there isan edge pattern matching. The pseudo-code of MC-EPM isshown in Algorithm 2.Example 13: Consider the example shown in Fig. 2, wherethe multiple social impact factor constraints and the boundedlength are given to edge (B,D) in GQ4. In order to re-turn an edge pattern matching answer in GD5, MC-EPMcomputes δmin(B,D) = 1 in GD5. Therefore, there is anedge pattern matching in GD5 for (B,D) in GQ4. The pathwith edges (B,C2), (C2, C1) and (C1, D) has δmin, and thus(B,C2, C1, D,GD5) ' (B,D,GQ4).

Below Theorem 6 establishes that HAMC is an effectivealgorithm in MC-GPM.Theorem 6: HAMC can return an edge pattern matchinganswer if one exists in the data graph.Proof: Assume GQ = (V,E), and (vi, vj) ∈ E is an edgepattern query. Let p∗ be a path from vi to vj in GD with theminimal δ at vi returned by the HAMC, and p∗∗ is another pathbetween vi and vj in GD, where (vi, vj , GQ) ' (p∗∗, GD).Then, assume (vi, vj , GQ) 6' (p∗, GD), then ∃ϕ ∈ {T, r, ρ}that ASϕp∗ < λDi

ϕ(vi,vj)or Plen(p∗) > Blen(vi, vj). Hence,

δ(p∗) > 1. Since p∗∗ is an edge pattern matching, thenδ(p∗∗) ≤ 1 and δ(p∗) > δ(p∗∗). This contradicts δ(p∗) ≤δ(p∗∗). Therefore, (vi, vj , GQ) ' (p∗, GD). Theorem 6 isproven. �

Summary: MC-EPM is the first part of HAMC, which iseffective in MC-GPM as it can return an edge pattern matchinganswer if one exists in GD. In addition, in the worst case thereis no vertex included into any SSCs of GD. Then, for an edgepattern query in GQ, HAMC employs a Dijkstra’s algorithmfor M pairs of vertices in GD. Therefore, the time complexityof MC-GPM is O(MNDlogND +MED).

B. Exploration-Based Graph Pattern Matching

In the literature, there are two popular methods to answera GPM query based on the edge pattern matching. They arethe join-based method [2], [9], [10] and the exploration-basedmethod [16], [17]. The join-based method aims to find amaximal matching that contains all matching subgraphs ina data graph, while the exploration-based method aims toquickly answer a GPM query.

As MC-GPM is NP-Complete, it is computationally in-feasible to find all the matching subgraphs in GD. In or-der to quickly answer an MC-GPM query, we propose

358

an Exploration-Based Graph Pattern Matching (EB-GPM)method, presented below.Step 1: Starting from a source vertex (a vertex with indegreezero, denoted as sv), EB-GPM firstly returns an edge patternmatching result based on the above introduced MC-EPM.Step 2: Mark the matching edge as explored and investigateif the end point of the edge (denoted as ep) is a leaf vertex inthe query.• If the end point of the edge is a leaf vertex, EB-GPM

rolls back to the start point of the edge (denoted as sp)and matching another unmatched edge starting from svin GQ.

• Otherwise, EB-GPM continues to investigate anotherunexplored edge from ep.

Step 3: If sp = sv and each of the edge pattern queries in GQcorresponds to an explored edge in GD, an MC-GPM queryanswer is returned.Example 14: Consider the MC-GPM example in Fig. 2,which contains two source vertices, A and B. Then from A,MC-EPM returns an edge pattern matching result in GD5 as(A,C1, D,GD5) ' (A,D,GQ4). As D is a leaf vertex inGQ4, EB-GPM rolls back to source vertex A to investigate ifan edge from A has not been explored in GQ4. As (A,C) isunexplored, MC-EPM returns another edge pattern matchingresult in GD5 as (A,C1, GD5) ' (A,C,GQ4). Since allthe edges from A are explored, EB-GPM then completesthe same process from the other source vertex B by MC-EPM. Then EB-GPM can return an MC-GPM answer asGM = (V,E,LV, LE), where V = {A,B,C1, C2, D} andE = {(A,C1), (B,C2), (C2, C1), (C1, D)}.

If there are more than one MC-GPM results included ina data graph, we can use EB-GPM to return other resultsby replacing one of the explored matching edge with anotherunexplored matching edge in the data graph.

The pseudo-code of EP-GPM method is shown in Algo-rithm 3. EP-GPM performs EQ times of MC-EPM methods.The time complexity of HAMC is O(EQMNDlogND +MEQED). But EQ and M have an inverse relation asM = ED

EQ[4]. Namely, when EQ has a large value, e.g.,

EQ = ED, M = 1, and vice versa. Therefore, the timecomplexity of our HAMC is O(EDNDlogND + EQED).

C. Summary

Our proposed HAMC algorithm is an efficient and ef-fective method for the NP-Complete MC-GPM problem inlarge-scale contextual social graphs. Our method achieves inO(EDNDlogND+EQED) computation cost. Moreover, if thematching edges are included into the graphs of SSCs, HAMCachieves the outstanding O(EQ) computation cost.

VIII. EXPERIMENTS

We conduct experiments on five large-scale real-world so-cial graphs to evaluate (1) the performance our algorithm inanswering MC-GPM queries; (2) the effectiveness of our indexfor SSC in improving the efficiency of MC-GPM, and (3) the

TABLE ITHE SOCIAL DATASETS

Name vertices Edges DescriptionEpinions 75,879 508,837 A trust-oriented social networkDBLP 317,080 1,049,866 A co-author relationship network

Youtube 1,134,890 2,987,624 A video recommendation social networkPokec 1,632,803 30,622,564 A general online social network

LiveJournal 4,847,571 68,993,773 A general online social network

effectiveness of our graph compression approach in reducingmemory usage and improving the efficiency of MC-GPM.

A. Experiment setting

Datasets:We use five large-scale real-world social graphs available at

snap.stanford.edu, which have been widely used in the litera-ture for graph pattern matching and social network analysis.The details of these datasets are shown in Table I.Graph Pattern Query and Paramater Setting:• As we discussed in Section III, the social context impact

factor values (i.e., T , r and ρ) can be mined from theexisting social networks, which is another very challeng-ing problem, but out of the scope of this work. Moreover,in the real cases, the values of these impact factors canvary from low to high value without any fixed patterns.Without loss of generality, we randomly set the values ofthese impact factors by using the function rand() in SQL.In addition, in each of the datasets, the SSC number isset to 20, 40, 60, 80 and 100 respectively.

• We use a simple MC-GPM query that has the samegraph structure as the graph in Fig. 6. Moreover, a set ofrelative low constraints are specified as λDi

T = 0.05, λr =0.05, λDi

ρ = 0.2, and Blen = 4. These settings ensurethe high possibility of returning MC-GPM answers ina data graph. Otherwise, no or only few answers mightbe returned by all the algorithms, making it difficult toinvestigate their performance.

Implementation:As we discussed in Section II, there is no existing GPM

method in the literature for the MC-GPM problem. There-fore, in the experiments, (1) we first implement BaseLineextended from bounded simulation [2] to illustrate a baselineperformance of MC-GPM; (2) we then implement our HAMCalgorithm without graph compression (denoted as HAMC-NoCompression) to investigate the influence of indices on MC-GPM; (3) we implement HAMC algorithm with both indicesand graph compression to further investigate the influence ofgraph compression on MC-GPM; and (4) As returning all theMC-GPM answers included in a GD is NP-Complete [34],we compare the performance of three algorithms in finding acertain number of answers.

All BaseLine, HAMC-NoCompression and HAMC algo-rithms are implemented using Visual C++ running on a PCwith Intel Core i5-3470 3.20GHz CPU, 16GB RAM, Win-dows 7 operating system and MySql 5.6 database. All theexperimental results are averaged based on five independentruns.

359

Epinions DBLP Youtube Pokec LiveJournal0

50

100

150

200

250

300

350

ThenumberofG

M

HAMC−NoCompressHAMCBaseline

Fig. 7. The average number of answers returned after visiting 1000 mappedpaths

B. Experimental Results and Analysis

Exp-1: Effectiveness. This experiment is to investigate theeffectiveness of our HAMC by comparing the average numberof GPM answers returned by the three methods.

Results: Fig. 7 depicts the average number of GPM answersreturned by each of BaseLine, HAMC-NoCompression andHAMC after visiting 1,000 mapped paths on the five datasets.From the figure, we can see that (1) the number of answersreturned by BaseLine is always less than that of HAMC-NoCompressions and HAMC; and (2) HAMC returns the samenumber of answers as HAMC-NoCompression. Statistically,on average, HAMC and HAMC-NoCompression can return37.8% more answers than BaseLine. Thus, they are moreeffective than BaseLine.

Analysis: The experimental results illustrate that (1) Base-Line ineffectively visits many paths that do not correspondingto any matching edge in GPM, while based on Theorem 6,our HAMC and HAMC-NoCompression can return an edgepattern matching result if one exists in a data graph, showingthey are more effective in answering MC-GPM queries thanBaseLine; and (2) As illustrated in Theorem 5, our compres-sion method is reachability preserved, graph pattern preservedand social context preserved. Therefore, HAMC and HAMC-NoCompression can visit the same mapped path based onthe same MC-EPM, thus they can return the same numberof answers based on EP-GPM.Exp-2: Efficiency. This experiment is to investigate the ef-ficiency of our HAMC by (1) comparing the average queryprocessing time of the three methods for outputting differentnumbers of answers, and (2) comparing their average queryprocessing time under different numbers of SSCs when re-turning 100 query answers.

Result-1: Fig. 8 depicts the average query processing timeof BaseLine, HAMC-NoCompression and HAMC in returningthe answers (i.e., GM ) of the MC-GPM query on the fivedatasets. From the figure, we can see that (1) when thenumber of answers increases, the total average query pro-cessing time of the three methods increases for datasets; (2)BaseLine spends more query processing time than HAMC-NoCompression and HAMC in all the cases; and (3) HAMChas the best efficiency for the MC-GPM in all the five datasets.Statistically, on average, the query processing time of HAMCis 21.6% less than that of BaseLine, and 5.7% less than thatof HAMC-NoCompression.

Analysis-1: Result-1 illustrates (1) BaseLine needs morequery processing time than others as it spends much time in

TABLE IITHE COMPARISON OF THE AVERAGE QUERY PROCESSING TIME BETWEEN

HAMC AND OTHERS WITH DIFFERENT NUMBER OF SSC

SSCComparison with

BaseLineComparison with

HAMC-NoCompression20 13.27% less 4.67% less40 19.89% less 4.76% less60 24.06% less 5.60% less80 27.64% less 6.09% less100 32.19% less 4.27% less

TABLE IIITHE COMPARISON OF THE AVERAGE QUERY PROCESSING TIME BETWEEN

HAMC AND OTHERS ON FIVE DATASETS

DatasetComparison with

BaseLineComparison with

HAMC-NoCompressionEpinions 8.11% less 2.09% lessDBLP 16.9% less 7.82% less

Youtube 35.88% less 8.84% lessPokec 39.67% less 8.59% less

LiveJournal 34.18% less 2.81% less

each edge pattern query; (2) our novel heuristic search strategyand the proposed index structure can greatly save the queryprocessing time, as some of the edge pattern matching canbe returned via searching indexed information; and (3) theproposed graph compression reduces the index search space,and thus HAMC is more efficient than the other two methods.

Result-2: Fig. 9 depicts the average query processing timewhen returning 100 answers for the MC-GPM query withdifferent numbers of SSCs on the five datasets. From thefigure, we can see that (1) with the increase of the numberof SSC, the average query processing time of both HAMC-NoCompression and HAMC decreases in all the cases; (2)BaseLine needs more query processing time than others evenat SSC = 20 on the five datasets; and (3) under a certainnumber of SSCs, HAMC has the minimal query processingtime among all the three methods. The detailed experimentalresults are listed in Table II. Statistically, on average, the queryprocessing time of HAMC is 21.4% less than that of BaseLineand 4.2% less than that of HAMC-NoCompression.

Analysis-2: Result-2 illustrates (1) the large query process-ing time of BaseLine is consistent with the first analysis inAnalysis-1; (2) the more the SSCs, the higher the probabilityof finding the mapped edges based on the indexed information,and thus the query processing time decreases with the increaseof the number of SSC; and (3) as analysed in Analysis-1, ourgraph compression reduces the index search space, and thusHAMC has the minimal query processing time of all methods,which is consistent with Analysis-1.Exp-3: Scalability. This experiment is to investigate thescalability of our HAMC by comparing the average queryprocessing time of returning a query answer in social graphswith different scales.

Results: Fig. 10 depicts the average query processing timeof the MC-GPM query under different scales of social graphswith a certain number of SSCs. From the figure, we can seethat (1) both HAMC-NoCompression and HAMC scales betterthan BaseLine; (2) The larger the number of SSCs, the betterthe scalability of our method, and (3) HAMC has the bestscalability among all the three methods. The detailed results

360

20 40 60 80 10050

100

150

200

250

300

350

400

450

(a) Epinions

The number of GM

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

80.276.4

85.2

164.9157.4

168.7

247.4235.6

252.1

326.1314.8

339.5

401.7

386.7

421.9HAMC−NoCompress

HAMC

BaseLine

20 40 60 80 100100

200

300

400

500

600

700

800

900

1000

1100

(b) DBLP

The number of GMA

vera

ge q

uery

pro

cessin

g tim

e (

Sec.)

175.4162.3

214.0

355.8328.5

425.6

530.3486.8

641.1

731.7

670.1

872.2877.4

806.4


HAMC

BaseLine

20 40 60 80 100200

400

600

800

1000

1200

1400

1600

1800

2000

2200

(c) Youtube

The number of GM

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

283.2270.3

403.3

623.2575.3

866.4953.2

865.3

1270.2 1213.2

1105.3

1589.41455.3

1371.8


HAMC

BaseLine

20 40 60 80 1000

500

1000

1500

2000

2500

(d) Pokec

The number of GM

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

390.4373.6

511.3692.2663.3

923.41093.2

1042.4

1435.5 1525.4

1492.2

2015.31908.4

1845.8


HAMC

BaseLine

20 40 60 80 100500

1000

1500

2000

2500

3000

3500

4000

4500

5000

(e) LiveJournal

The number of GM

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

905.5825.4

1012.3

1834.5

1648.5

2028.7

2522.5

2453.2

2905.33128.5

3002.4

3721.44066.7

3801.4


HAMC

BaseLine

Fig. 8. The average query processing time based on different numbers of Gm

20 40 60 80 100

380

390

400

410

420

430

440(a) Epinions

The number of SSC

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

409.7

396.0

421.9

398.1

393.4395.1

387.2390.2

384.5386.8

377.2

HAMC−NoCompress

HAMC

BaseLine

20 40 60 80 100700

750

800

850

900

950

1000

1050

1100

1150

1200(b) DBLP

The number of SSC

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

980.6

910.2

1065.3

940.7

851.8868.7

802.2

834.0

743.4

774.2

702.6

HAMC−NoCompress

HAMC

BaseLine

20 40 60 80 100800

1000

1200

1400

1600

1800

2000

2200

2400(c) Youtube

The number of SSC

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

1875.2

1771.2

2041.1

1600.7

1411.2 1400.4

1268.5 1274.4

1145.21035.4972.4

HAMC−NoCompress

HAMC

BaseLine

20 40 60 80 100

1600

1800

2000

2200

2400

2600

(d) Pokec

The number of SSC

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

2218.5

2103.3

2472.2

2003.8

1987.31887.3

1845.41766.4

1702.91598.3

1588.3

HAMC−NoCompress

HAMC

BaseLine

20 40 60 80 1003600

3800

4000

4200

4400

4600

4800

5000(e) LiveJournal

The number of SSC

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

4237.3

4086.3

4684.3

4112.3

3915.3

4043.1

3810.4

3967.8

3755.23773.4

3604.3

HAMC−NoCompress

HAMC

BaseLine

Fig. 9. The average query processing time based on different numbers of SSC


5

10

15

20

25

30

35

40

45

50SSC=20

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

4.103.96

4.219.819.10

10.65

18.7517.71

20.41 20.46

18.84

24.72

36.29

34.99


HAMC

BaseLine


5

10

15

20

25

30

35

40

45

50SSC=40

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

3.983.94

4.229.418.51

10.65

15.93

13.86

20.41

17.30

16.22

24.72

33.12

31.83


HAMC

BaseLine


5

10

15

20

25

30

35

40

45

50SSC=60

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

3.953.87

4.228.69

8.02

10.65

14.00

12.69

20.41

16.88

14.43

24.72

32.48

30.13


HAMC

BaseLine


5

10

15

20

25

30

35

40

45

50SSC=80

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

3.903.85

4.228.347.43

10.65312.75

11.45

20.41

14.60

14.01

24.72

31.12

30.40


HAMC

BaseLine


5

10

15

20

25

30

35

40

45

50SSC=100

Avera

ge q

uery

pro

cessin

g tim

e (

Sec.)

3.873.77

4.227.747.02

10.65 10.35

9.72

20.41

12.35

11.07

24.7227.74

26.82


HAMC

BaseLine

Fig. 10. The average query processing time of returning each MC-GPM answer

0 20 40 60 80 1001.675

1.68

1.685

1.69

1.695

1.7

1.705x 10

5 (a) Epinions

The number of SSC

Mem

ory

usage (

Kb.)

167712

168464

167832

169072

168024

169672

168220

170044

168472

170360

168584

HAMC−NoCompress

HAMC

Baseline

0 20 40 60 80 1003.82

3.83

3.84

3.85

3.86

3.87

3.88x 10

5 (b) DBLP

The number of SSC

Mem

ory

usage (

Kb.)

382580

383696

382952

384524

383758

385296

384214

386312

384821

387592

385573

HAMC−NoCompress

HAMC

Baseline

0 20 40 60 80 100

1.508

1.509

1.51

1.511

1.512

1.513

1.514

1.515x 10

6 (c) Youtube

The number of SSC

Mem

ory

usage (

Kb.)

1508100

1509532

1509060

1510984

1509938

1512224

1510728

1513660

1511748

1514252

1512108

HAMC−NoCompress

HAMC

Baseline

0 20 40 60 80 1009.453

9.4535

9.454

9.4545

9.455

9.4555

9.456x 10

6 (d) Pokec

The number of SSC

Mem

ory

usage (

Kb.)

9453352

9453884

9453468

9454136

9453667

9454700

9453832

9455484

9454124

9455996

9454688

HAMC−NoCompress

HAMC

Baseline

0 20 40 60 80 1001.1733

1.1734

1.1735

1.1735

1.1736

1.1736

1.1737

1.1737

1.1738

1.1738x 10

7 (e) LiveJournal

The number of SSC

Mem

ory

usage (

Kb.)

11733940

11734628

11733821

11735524

11734201

11736004

11734500

11736836

11735011

11737568

11735423

HAMC−NoCompress

HAMC

Baseline

Fig. 11. The memory usage

are listed in Table III.Analysis: The scalability experiment illustrates that (1)

BaseLine has the worst scalability in all the three meth-ods which is consistent with the time complexity analy-sis, i.e., O(EQNNDlogND + NEQED) for BaseLine andO(EQNDlogND+EQED) for the other two methods; (2) thelarger the number of SSC, the more the vertices and edgesare indexed, hence improving the scalability; and (3) HAMChas the best scalability as graph compression can reduce thesize of the original data graph.Exp-4: Memory Usage. This experiment is to investigate thememory usage of three methods.

Results: Fig. 11 depicts the memory usage of the three

methods under different numbers of SSCs on the five datasets.From the figure, we can see that (1) with the increase ofthe number of SSC, the memory usage of both HAMC-NoCompression and HAMC increases; (2) The memory usageof HAMC is lower than that of HAMC-NoCompression inall the cases; and (3) BaseLine consumes less memory thanHAMC and HAMC-NoCompression in all the cases exceptfor SSC = 20 in LiveJournal dataset, where HAMC con-sumes less memory than BaseLine. The memory usage withdifferent number of SSC is listed in Table IV. Statistically,on average, HAMC can save 1,267 KB memory than HAMC-NoCompression for a single dataset.

Analysis: The experimental results of memory usage illus-

361

TABLE IVTHE COMPARISON OF MEMORY USAGE BETWEEN HAMC AND

HAMC-NOCOMPRESSION WITH DIFFERENT NUMBERS OF SSC

SSC HAMC HAMC-NoCompression Save20 4,649,427 KB 4,650,041 KB 614 KB40 4,649,917 KB 4,650,848 KB 931 KB60 4,650,298 KB 4,651,579 KB 1,281KB80 4,650,835 KB 4,652,467 KB 1,632 KB

100 4,651,275 KB 4,653,153 KB 1,878 KB

trate that (1) HAMC-NoCompression needs to index the graphstructures and social context in an SSC, thus it consumes morememory than BaseLine; (2) HAMC compresses the graph in anSSC, thus can reduce the memory of storing and indexing thevertices and edges that are removed by the graph compressionmethod. In addition, the larger the number of SSC, the higherthe probability of compressing more vertices and edges. ThusHAMC consumes less memory than HAMC-NoCompression;and (3) All the social impact factors in an SSC are required tohave high values. Usually, only a few vertices and edges canbe included and compressed in an SSC, and thus the savedmemory by graph compression is usually less than the usageof storing indices. So, in most of the cases of SSC, HAMCconsumes more memory than BaseLine.Summary:

The above experimental results have demonstrated that theproposed heuristic matching strategies adopted in HAMCprovide an effective means to answer MC-GPM queries. Inaddition, with our proposed index structure and graph com-pression methods, HAMC can greatly save query processingtime, which makes HAMC significantly outperform BaseLinein effectiveness, efficiency and scalability with low memoryusage. Therefore, our HAMC is a very competitive algorithmfor the new NP-Complete MC-GPM problem in social networkbased applications.

IX. CONCLUSION

In this paper, we have proposed a new Multi-ConstrainedSimulation (MCS) to support a new type of Multi-ConstrainedGraph Pattern Matching (MC-GPM) that is a corner stonefor many social network based applications. Then, we havedeveloped a novel concept, strong social component, uponwhich we have designed a novel index structure and acontext-preserved graph compression method. Finally, we haveproposed a heuristic algorithm, HAMC employs our novelheuristic matching strategies for the NP-Complete MC-GPMproblem. HAMC achieves O(EDNDlogND+EQED) in timecost, and the experiments conducted on five real-world large-scale social graphs have demonstrated the superiority of ourproposed approaches in terms of effectiveness, efficiency,scalability and memory cost.

ACKNOWLEDGEMENT

This work was partially supported by Natural Science Foun-dation of China (Grant Nos. 61232006, 61303019, 61003044),Australian Research Council DP (DP140103171), DoctoralFund of Ministry of Education of China (20133201120012),and Collaborative Innovation Center of Novel Software Tech-nology and Industrialization, Jiangsu, China.

REFERENCES

[1] J. Brynielsson, J. Hogberg, L. Kaati, C. Martenson, and P. Svenson,“Detecting social positions using simulation,” in ASONAM’10, pp. 48–55.

[2] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu, “Graph patternmatching: From intractable to polynomial time,” in VLDB’10, pp. 264–275.

[3] W. Fan, X. Wang, and Y. Wu, “Expfinder: Finding experts by graphpattern matching,” in ICDE’13, pp. 1316–1319.

[4] M. R. Henzinger, T. Henzinger, and P. Kopke, “Computing simulationson finite and infinite graphs,” in FOCS’95.

[5] R. Jin, Y. Xiang, N. Ruan, and D. Fuhry, “3-hop: a high compressionindexing scheme for reachability query,” in SIGMOD’09, pp. 813–826.

[6] L. Zou, L. Chen, and M. T. Ozsu, “Distance-join: Pattern match queryin a large graph database,” in VLDB’09, pp. 886–897.

[7] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo, “Capturing topology ingraph pattern matching,” in VLDB’11, pp. 310–321.

[8] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu, “Adding regular expressionsto graph reachability and pattern queries,” in ICDE’11, pp. 39–50.

[9] W. Fan, X. Wang, and Y. Wu, “Diversified top-k graph pattern matching,”in VLDB’14.

[10] W. Fan, X. Wang, and Y.Wu, “Answering graph pattern queries usingviews,” in ICDE’14, pp. 184–195.

[11] G. Liu, Y. Wang, and M. A. Orgun, “Optimal social trust path selectionin complex social networks,” in AAAI’10, 2010, pp. 1391–1398.

[12] R. Milano, R. Baggio, and R. Piattelli, “The effects of online socialmedia on tourism websites,” in ITTT, 2011, pp. 471–483.

[13] G. Liu, Y. Wang, and M. A. Orgun, “Social context-aware trust networkdiscovery in complex contextual social networks,” in AAAI’12, 2012.

[14] J. M. Jaffe, “Algorithms for finding paths with multiple constraints,”Networks, vol. 14, pp. 91–116, 1984.

[15] P. Berger and T. Luckmann, The Social Construction of Reality: ATreatise in the Sociology of Knowledge. Anchor Books, 1966.

[16] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraphmatching on billion node graphs,” in VLDB’12, pp. 788–799.

[17] J. Cheng, X. Zeng, and J. X. Yu, “Top-k graph pattern matching overlarge graphs,” in ICDE’13, pp. 1033–1044.

[18] X. Yan, P. S. Yu, and J. Han, “Substructure similarity search in graphdatabases,” in SIGMOD’05, pp. 766–777.

[19] H. Shang, X. Lin, Y. Zhang, J. X. Yu, and W. Wang, “Connectedsubstructure similarity search,” in SIGMOD’10, pp. 903–914.

[20] Y. Zhu, L. Qin, J. X. Xu, and H. Cheng, “Finding top-k similar graphsin graph databases,” in EDBT’12, pp. 456–467.

[21] F. N. Afrati, D. Fotakis, and J. D. Ullman, “Enumerating subgraphinstances using map-reduce,” in ICDE’13, pp. 62–73.

[22] J. Gao, C. Zhou, J. Zhou, and J. X. Yu, “Continuous pattern detectionover billion-edge graph using distributed framework,” in ICDE’14, pp.556–567.

[23] Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu, “Parallel subgraphlisting in a large-scale graph,” in SIGMOD’14, pp. 625–636.

[24] J. Huang, K. Venkatraman, and D. J. Abadi, “Query optimization ofdistributed pattern matching,” in ICDE’14, pp. 64–75.

[25] W. Fan, X. Wang, and Y. Wu, “Querying big graphs within boundedresources,” in SIGMOD’14, pp. 301–312.

[26] J. M. Jaffe, “Graph-based shape indexing,” Machine Vision and Appli-cations, 2010.

[27] Y. Tian and J. Patel, “Tale: A tool for approximate large graph matching,”in ICDE’08.

[28] F. L. S. Yoo, Y. Yang and I. Moon, “Mining social networks forpersonalized email prioritization,” in KDD’09, 2009, pp. 967–976.

[29] J. Tang, J. Zhang, L. Yan, J. Li, L. Zhang, and Z. Su, “Arnetminer:Extraction and mining of academic social networks,” in KDD’08, 2008,pp. 990–998.

[30] G. Liu, Y. Wang, and M. A. Orgun, “Trust transitivity in complex socialnetworks,” in AAAI’11, pp. 1222–1229.

[31] D. Eppstein, “Finding the k shortest paths,” SIAM Journal on Computing,vol. 28, no. 2, pp. 652–673, 1999.

[32] N. Biggs, E. Lloyd, and R. Wilson, Graph Theory. Oxford UniversityPress, 1986.

[33] W. Fan, J. Li, X. Wang, and Y. Wu, “Query preserving graph compres-sion,” in SIGMOD’12, pp. 157–168.

[34] J. M. Jaffe, “Algorithms for finding paths with multiple constraints,”Networks, vol. 14, pp. 95–116, 1984.

362

conference: multi-constrained graph pattern matching in large-scale contextual social...

Documents