computer science and engineering treespan efficiently computing similarity all-matching gaoping zhu...
TRANSCRIPT
Computer Science and Engineering
TreeSpanEfficiently Computing Similarity All-Matching
Gaoping Zhu#, Xuemin Lin#, Ke Zhu#, Wenjie Zhang#, Jeffrey Xu Yu†
# The University of New South Wales† The Chinese University of Hong
Kong
Outline
• Introduction• State-of-the-Art• Our Approach• Experiments• Conclusions
1
Introduction — Graph Data
Chem-informatics Chemical Compounds (small size)
Bio-informatics PPI Networks (medium size)
Internet World Wide Web (large size)
2
Introduction — Exact All-Matching (I) Exact All-Matching Enumerate all exact (i.e. isomorphic) matches
of a query graph q in a data graph G.
Applications Query biological patterns in PPI networks. Detect suspicious bugs in software programs.
C
A B
D
q
C
A B
D
G
C
A
C
A B
D
B
D C
A
exact matches
3
Introduction — Exact All-Matching (II) Dilemma of Exact All-Matching If q is issued by user for exploratory purpose … If G is noisy due to imprecise data collection …
Potential Solutions Modify q/G and run exact all-matching again and
again. Ask system to return approximate results (i.e.,
similarity all-matching)
No exact matches can be found!
C
A B
D
G
C
A
C
A B
D
q'
4
SAPPER [VLDB’10 Zhang et al] (I)
Similarity All-Matching Given a query graph q, a data graph G and a similarity
threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q).
Framework Enumerate a set of seeds QSAPPER (i.e., all connected
subgraphs q’ of q missing θ edges in q). Exact all-matching on each seed q’ to obtain exact
matches. Induce similarity matches based on exact matches of seeds.
5
SAPPER [VLDB’10 Zhang et al] (II)
Cost Model
|QSAPPER | = # of exact all-matching tests
6
C
A A
BG
D
C
A A
B
C
A A
Bq (θ = 1)
C
A A
B C
A A
B C
A A
B C
A A
B C
A A
B
F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}
u1
u4
u2
u3
v1 v2 v5
v4 v3
F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}
inducingmatchingallQq
seedingtotal CGqCCCSAPPER
),'(
'
C
A A
B C
A A
B
C
A A
B C
A A
B
q'1 q'2
q'3 q'4
Our Approach — Overview (I)
Tree-based Spanning Search Paradigm — TSpan Enumerate a set of seeds QT (i.e., spanning trees of q
cover all connected subgraph q’ of q missing θ edges in q).
Primary Contribution Reduce # of exact all-matching tests (i.e., # of seeds). Reduce the complexity of exact all-matching test from
graph to graph to tree to graph.
C
A B
D
q (θ = 2)
C
A B
D C
A B
D C
A B
D
7
more SAPPER seeds
3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q
Our Approach — Overview (II)
Generating Similarity Maximal Matches
Generating similarity maximal matches only can reduce # of exact all-matching tests.
8
C
A A
BG
D
C
A A
B
C
A A
Bq (θ = 1)
C
A A
B C
A A
B C
A A
B C
A A
B C
A A
B
F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}
u1
u4
u2
u3
v1 v2 v5
v4 v3
F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}similarity maximal matches
Our Approach — Problem Statement Similarity Maximal All-Matching Given a query graph q, a data graph G and a
similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ.
9
Our Approach — Seeding (I)
PRIM Order on Spanning Trees Similar to the basic idea of minimum spanning tree. Given a total order on E(q), a spanning tree T =
{T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1], ..., T[i − 1]} and connects {T[0], T[1], ..., T[i − 1]}.
C
A B
D
e1
e2
e3
e4 e5 e6
q
C
A B
D
e1
e2
e3
T
10
Our Approach — Seeding (II)
Avoid Duplicate Results Two spanning trees of q may induce duplicate similarity
maximal matches.
Associate an edge exclusion set T.R to each T in QT.
T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T.
C
A B
D
q (θ = 2)
E
A A
C
G
B
D C
A B
D
T1
C
A B
DT2
T2.R = { (A,D) }T1.R = ∅
11
Our Approach — Seeding (III)
C
A B
D
e1
e2
e3
e4 e5 e6
e1e2
e3
e1e2
e4
e1e2
e5
e1e4
e3
e1e4
e6
e1e5
e3
e4e3
e2
e4e3
e5
e4e5
e2
e6e2
e3
XT1[1]e1
XT1[2]e2
XT1[3]e3
XT2[3]e4
XT4[3]e3
XT4[2] e4
XT7[3]e2
XT7[2]e3
XT7[1]e4
T T.R T T.R
1.e1e2e3
{ }2.e1e2e4
{e3 }3.e1e2e5 {e3,e4 }4.e1e4e3
{e2 }5.e1e4e6
{e2,e3 }
6.e1e5e3
{e2,e4 }7.e4e3e2
{e1 }8.e4e3e5
{e1,e2 }9.e4e5e2
{e1,e3 }10.e6e2e3
{e1,e4 }
q (θ =2)
QT Enumeration Algorithm
go down
alternate-reorder
12
Our Approach — Seeding (IV)
QT Enumeration Algorithm
Correctness : Using QT to inducing similarity maximal matches neither generates duplicate results nor misses valid results.
Minimality of QT : Missing any spanning tree in QT does not guarantee the completeness of results based on edge exclusion semantics.
When |E(q)| = m, |V(q)| = n, (1)|QSAPPER| ≥ |QT|;
(2) |QT| = |QSAPPER| only when θ = 0 or m − n + 1.)|.|
|.|1(||
RT
RTnmQ
TQTSAPPER
13
Our Approach — Searching (I)
Effectively Storing QT
Use DFS Traversal Tree to share computation cost.e1e2
e3
e1e2
e4
e1e2
e5
e1e4
e3
e1e4
e6
e1e5
e3
e4e3
e2
e4e3
e5
e4e5
e2
e6e2
e3
R
e1
e4
e2
e3
e4
e5
e4
e3
e6
e3
e5
e3
e2
e5
e2
e5
e6
e3
e2
14
Our Approach — Searching (II)
Similarity Maximal All-Matching Algorithm Sketch Traverse the DFS Traversal Tree in a depth-first backtrack
search fashion. go-down : Beginning from the initial spanning tree,
recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T.
alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i] .
15
Our Approach — Optimizations
Optimizations (I) EnumrateOnDemand Strategy Motivation : further reduce the number of seeds. Enumerate an alternative tree T’ based on the current
tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’.
Optimizations (II) Effective Search Order Motivation : terminate all-matching test as early as
possible. Decide the search order of spanning edges in T based on
the post-filtering candidate sets of each vertex in q.
16
Our Approach — Filtering & Ordering (I)
Neighborhood Aggregate N(v, g) Given a set of labels ΣV = {L1, ..., Lm}, N(v, g) = (x1,
..., xm) where xi is the number of neighbors of v in g with label Li ∈ ΣV.
Neighborhood-based Filtering Compute the candidate set C(u) for each u in q.
A
B
D
A A
Du ∈ q
A
B
C
B A
Cv ∈ G
N(u, q) = {2, 1, 0, 2}
N(v, G) = {1, 2, 2, 0}
17
Our Approach — Filtering & Ordering (II)
QI Search Ordering [VLDB’08 Shang et al.] Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the
occurrence of vertices in G with l(u)). Pick Next Spanning Edge : The edge (u1, u2) with minimum φ(u1,
u2) (i.e., the occurrences of edges in G with (l(u1), l(u2))) where u1 is a vertex incident on previous picked spanning edges.
Filtering-based Search Ordering Pick Head Vertex : The vertex u in q with minimum number of
candidates (i.e., |C(u)|). Pick Next Spanning Edge : The edge (u1, u2) minimizing
|C(u2)|×φ (u1, u2)/φ(u2)
where u1 is vertex incident on previous picked spanning edges.
18
Experiments — Experimental Settings
Data Graphs GH : HPRD network (|V(GH)| = 9,460, |E(GH)| = 37,081).
GS : default synthetic data graph. Other synthetic data graphs generated by varying data graph settings.
Query Graphs Random selected subgraphs of the corresponding data graphs.
Parameter Settings (default settings in bold)
|V(G)| 5k, 10k, 20k, 40k, 80k
avg. deg(G)
4, 8, 12, 16, 20
|ΣV | 20, 50, 100, 200
|V(q)| 20, 40, 60, 80, 100
avg. deg(q) 3, 4, 5, 6
θ 1, 2, 3, 4
19
|QSAPPER| : # of exact all-matching tests by SAPPER [VLDB’10].
|QT| : # of exact all-matching tests by EnumerateAll paradigm. TSpan : # of exact all-matching tests by EnumerateOnDemand
paradigm.
Experiments — # of exact all-matching tests
20
Similarity All-Matching SAPPER : Generate all similarity matches. TSpan+ : Run TSpan first and then generate all similarity
matches based on similarity maximal matches.
Similarity Maximal All-Matching NaïveTSpan : Similarity maximal all-matching with no
computation sharing. TSpan : Similarity maximal all-matching with computation
sharing.
Experiments — Total Processing Time
21
Enumeration Paradigms PrecTSpan : Similarity maximal
all-matching by EnumerateAll. TSpan : Similarity maximal all-
matching by EnumerateOnDemand.
Filtering & Ordering TSpanQI : TSpan algorithm with
QI searching ordering. TSpanNF : TSpan algorithm
with no filtering technique.
Experiments — Total Processing Time
22
TSpan on Large-scale Datasets
Experiments — Large-scale Data Graphs
23
Conclusions
Tree-based Spanning Search Paradigm
EnumerateOnDemand Strategy Filtering-based Search Ordering
SAPPER TSpan
# of all-matching tests
significantly less
each all-matching test
graph to graph tree to graph
computation-sharing no yes
similarity results non-maximal maximal
mC
24
Thank You!Any Questions?