computer science and engineering treespan efficiently computing similarity all-matching gaoping zhu...

Computer Science and Engineering

TreeSpanEfficiently Computing Similarity All-Matching

Gaoping Zhu#, Xuemin Lin#, Ke Zhu#, Wenjie Zhang#, Jeffrey Xu Yu†

# The University of New South Wales† The Chinese University of Hong

Kong

Outline

• Introduction• State-of-the-Art• Our Approach• Experiments• Conclusions

1

Introduction — Graph Data

Chem-informatics Chemical Compounds (small size)

Bio-informatics PPI Networks (medium size)

Internet World Wide Web (large size)

2

Introduction — Exact All-Matching (I) Exact All-Matching Enumerate all exact (i.e. isomorphic) matches

of a query graph q in a data graph G.

Applications Query biological patterns in PPI networks. Detect suspicious bugs in software programs.

C

A B

D

q

C

A B

D

G

C

A

C

A B

D

B

D C

A

exact matches

3

Introduction — Exact All-Matching (II) Dilemma of Exact All-Matching If q is issued by user for exploratory purpose … If G is noisy due to imprecise data collection …

Potential Solutions Modify q/G and run exact all-matching again and

again. Ask system to return approximate results (i.e.,

similarity all-matching)

No exact matches can be found!

C

A B

D

G

C

A

C

A B

D

q'

4

SAPPER [VLDB’10 Zhang et al] (I)

Similarity All-Matching Given a query graph q, a data graph G and a similarity

threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q).

Framework Enumerate a set of seeds QSAPPER (i.e., all connected

subgraphs q’ of q missing θ edges in q). Exact all-matching on each seed q’ to obtain exact

matches. Induce similarity matches based on exact matches of seeds.

5

SAPPER [VLDB’10 Zhang et al] (II)

Cost Model

|QSAPPER | = # of exact all-matching tests

6

C

A A

BG

D

C

A A

B

C

A A

Bq (θ = 1)

C

A A

B C

A A

B C

A A

B C

A A

B C

A A

B

F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}

u1

u4

u2

u3

v1 v2 v5

v4 v3

F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}

inducingmatchingallQq

seedingtotal CGqCCCSAPPER

),'(

'

C

A A

B C

A A

B

C

A A

B C

A A

B

q'1 q'2

q'3 q'4

Our Approach — Overview (I)

Tree-based Spanning Search Paradigm — TSpan Enumerate a set of seeds QT (i.e., spanning trees of q

cover all connected subgraph q’ of q missing θ edges in q).

Primary Contribution Reduce # of exact all-matching tests (i.e., # of seeds). Reduce the complexity of exact all-matching test from

graph to graph to tree to graph.

C

A B

D

q (θ = 2)

C

A B

D C

A B

D C

A B

D

7

more SAPPER seeds

3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q

Our Approach — Overview (II)

Generating Similarity Maximal Matches

Generating similarity maximal matches only can reduce # of exact all-matching tests.

8

C

A A

BG

D

C

A A

B

C

A A

Bq (θ = 1)

C

A A

B C

A A

B C

A A

B C

A A

B C

A A

B

F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}

u1

u4

u2

u3

v1 v2 v5

v4 v3

F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}similarity maximal matches

Our Approach — Problem Statement Similarity Maximal All-Matching Given a query graph q, a data graph G and a

similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ.

9

Our Approach — Seeding (I)

PRIM Order on Spanning Trees Similar to the basic idea of minimum spanning tree. Given a total order on E(q), a spanning tree T =

{T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1], ..., T[i − 1]} and connects {T[0], T[1], ..., T[i − 1]}.

C

A B

D

e1

e2

e3

e4 e5 e6

q

C

A B

D

e1

e2

e3

T

10

Our Approach — Seeding (II)

Avoid Duplicate Results Two spanning trees of q may induce duplicate similarity

maximal matches.

Associate an edge exclusion set T.R to each T in QT.

T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T.

C

A B

D

q (θ = 2)

E

A A

C

G

B

D C

A B

D

T1

C

A B

DT2

T2.R = { (A,D) }T1.R = ∅

11

Our Approach — Seeding (III)

C

A B

D

e1

e2

e3

e4 e5 e6

e1e2

e3

e1e2

e4

e1e2

e5

e1e4

e3

e1e4

e6

e1e5

e3

e4e3

e2

e4e3

e5

e4e5

e2

e6e2

e3

XT1[1]e1

XT1[2]e2

XT1[3]e3

XT2[3]e4

XT4[3]e3

XT4[2] e4

XT7[3]e2

XT7[2]e3

XT7[1]e4

T T.R T T.R

1.e1e2e3

{ }2.e1e2e4

{e3 }3.e1e2e5 {e3,e4 }4.e1e4e3

{e2 }5.e1e4e6

{e2,e3 }

6.e1e5e3

{e2,e4 }7.e4e3e2

{e1 }8.e4e3e5

{e1,e2 }9.e4e5e2

{e1,e3 }10.e6e2e3

{e1,e4 }

q (θ =2)

QT Enumeration Algorithm

go down

alternate-reorder

12

Our Approach — Seeding (IV)

QT Enumeration Algorithm

Correctness : Using QT to inducing similarity maximal matches neither generates duplicate results nor misses valid results.

Minimality of QT : Missing any spanning tree in QT does not guarantee the completeness of results based on edge exclusion semantics.

When |E(q)| = m, |V(q)| = n, (1)|QSAPPER| ≥ |QT|;

(2) |QT| = |QSAPPER| only when θ = 0 or m − n + 1.)|.|

|.|1(||

RT

RTnmQ

TQTSAPPER

13

Our Approach — Searching (I)

Effectively Storing QT

Use DFS Traversal Tree to share computation cost.e1e2

e3

e1e2

e4

e1e2

e5

e1e4

e3

e1e4

e6

e1e5

e3

e4e3

e2

e4e3

e5

e4e5

e2

e6e2

e3

R

e1

e4

e2

e3

e4

e5

e4

e3

e6

e3

e5

e3

e2

e5

e2

e5

e6

e3

e2

14

Our Approach — Searching (II)

Similarity Maximal All-Matching Algorithm Sketch Traverse the DFS Traversal Tree in a depth-first backtrack

search fashion. go-down : Beginning from the initial spanning tree,

recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T.

alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i] .

15

Our Approach — Optimizations

Optimizations (I) EnumrateOnDemand Strategy Motivation : further reduce the number of seeds. Enumerate an alternative tree T’ based on the current

tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’.

Optimizations (II) Effective Search Order Motivation : terminate all-matching test as early as

possible. Decide the search order of spanning edges in T based on

the post-filtering candidate sets of each vertex in q.

16

Our Approach — Filtering & Ordering (I)

Neighborhood Aggregate N(v, g) Given a set of labels ΣV = {L1, ..., Lm}, N(v, g) = (x1,

..., xm) where xi is the number of neighbors of v in g with label Li ∈ ΣV.

Neighborhood-based Filtering Compute the candidate set C(u) for each u in q.

A

B

D

A A

Du ∈ q

A

B

C

B A

Cv ∈ G

N(u, q) = {2, 1, 0, 2}

N(v, G) = {1, 2, 2, 0}

17

Our Approach — Filtering & Ordering (II)

QI Search Ordering [VLDB’08 Shang et al.] Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the

occurrence of vertices in G with l(u)). Pick Next Spanning Edge : The edge (u1, u2) with minimum φ(u1,

u2) (i.e., the occurrences of edges in G with (l(u1), l(u2))) where u1 is a vertex incident on previous picked spanning edges.

Filtering-based Search Ordering Pick Head Vertex : The vertex u in q with minimum number of

candidates (i.e., |C(u)|). Pick Next Spanning Edge : The edge (u1, u2) minimizing

|C(u2)|×φ (u1, u2)/φ(u2)

where u1 is vertex incident on previous picked spanning edges.

18

Experiments — Experimental Settings

Data Graphs GH : HPRD network (|V(GH)| = 9,460, |E(GH)| = 37,081).

GS : default synthetic data graph. Other synthetic data graphs generated by varying data graph settings.

Query Graphs Random selected subgraphs of the corresponding data graphs.

Parameter Settings (default settings in bold)

|V(G)| 5k, 10k, 20k, 40k, 80k

avg. deg(G)

4, 8, 12, 16, 20

|ΣV | 20, 50, 100, 200

|V(q)| 20, 40, 60, 80, 100

avg. deg(q) 3, 4, 5, 6

θ 1, 2, 3, 4

19

|QSAPPER| : # of exact all-matching tests by SAPPER [VLDB’10].

|QT| : # of exact all-matching tests by EnumerateAll paradigm. TSpan : # of exact all-matching tests by EnumerateOnDemand

paradigm.

Experiments — # of exact all-matching tests

20

Similarity All-Matching SAPPER : Generate all similarity matches. TSpan+ : Run TSpan first and then generate all similarity

matches based on similarity maximal matches.

Similarity Maximal All-Matching NaïveTSpan : Similarity maximal all-matching with no

computation sharing. TSpan : Similarity maximal all-matching with computation

sharing.

Experiments — Total Processing Time

21

Enumeration Paradigms PrecTSpan : Similarity maximal

all-matching by EnumerateAll. TSpan : Similarity maximal all-

matching by EnumerateOnDemand.

Filtering & Ordering TSpanQI : TSpan algorithm with

QI searching ordering. TSpanNF : TSpan algorithm

with no filtering technique.

Experiments — Total Processing Time

22

TSpan on Large-scale Datasets

Experiments — Large-scale Data Graphs

23

Conclusions

Tree-based Spanning Search Paradigm

EnumerateOnDemand Strategy Filtering-based Search Ordering

SAPPER TSpan

# of all-matching tests

significantly less

each all-matching test

graph to graph tree to graph

computation-sharing no yes

similarity results non-maximal maximal

mC

24

Thank You!Any Questions?

computer science and engineering treespan efficiently computing similarity all-matching gaoping zhu...

Documents

similarity matches

query graph q

trees of q

seed q

exact matches of seeds

connected subgraphs

similarity threshold

data graph g