![Page 1: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/1.jpg)
Computer Science and Engineering
TreeSpanEfficiently Computing Similarity All-Matching
Gaoping Zhu#, Xuemin Lin#, Ke Zhu#, Wenjie Zhang#, Jeffrey Xu Yu†
# The University of New South Wales† The Chinese University of Hong
Kong
![Page 2: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/2.jpg)
Outline
• Introduction• State-of-the-Art• Our Approach• Experiments• Conclusions
1
![Page 3: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/3.jpg)
Introduction — Graph Data
Chem-informatics Chemical Compounds (small size)
Bio-informatics PPI Networks (medium size)
Internet World Wide Web (large size)
2
![Page 4: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/4.jpg)
Introduction — Exact All-Matching (I) Exact All-Matching Enumerate all exact (i.e. isomorphic) matches
of a query graph q in a data graph G.
Applications Query biological patterns in PPI networks. Detect suspicious bugs in software programs.
C
A B
D
q
C
A B
D
G
C
A
C
A B
D
B
D C
A
exact matches
3
![Page 5: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/5.jpg)
Introduction — Exact All-Matching (II) Dilemma of Exact All-Matching If q is issued by user for exploratory purpose … If G is noisy due to imprecise data collection …
Potential Solutions Modify q/G and run exact all-matching again and
again. Ask system to return approximate results (i.e.,
similarity all-matching)
No exact matches can be found!
C
A B
D
G
C
A
C
A B
D
q'
4
![Page 6: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/6.jpg)
SAPPER [VLDB’10 Zhang et al] (I)
Similarity All-Matching Given a query graph q, a data graph G and a similarity
threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q).
Framework Enumerate a set of seeds QSAPPER (i.e., all connected
subgraphs q’ of q missing θ edges in q). Exact all-matching on each seed q’ to obtain exact
matches. Induce similarity matches based on exact matches of seeds.
5
![Page 7: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/7.jpg)
SAPPER [VLDB’10 Zhang et al] (II)
Cost Model
|QSAPPER | = # of exact all-matching tests
6
C
A A
BG
D
C
A A
B
C
A A
Bq (θ = 1)
C
A A
B C
A A
B C
A A
B C
A A
B C
A A
B
F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}
u1
u4
u2
u3
v1 v2 v5
v4 v3
F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}
inducingmatchingallQq
seedingtotal CGqCCCSAPPER
),'(
'
C
A A
B C
A A
B
C
A A
B C
A A
B
q'1 q'2
q'3 q'4
![Page 8: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/8.jpg)
Our Approach — Overview (I)
Tree-based Spanning Search Paradigm — TSpan Enumerate a set of seeds QT (i.e., spanning trees of q
cover all connected subgraph q’ of q missing θ edges in q).
Primary Contribution Reduce # of exact all-matching tests (i.e., # of seeds). Reduce the complexity of exact all-matching test from
graph to graph to tree to graph.
C
A B
D
q (θ = 2)
C
A B
D C
A B
D C
A B
D
7
more SAPPER seeds
3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q
![Page 9: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/9.jpg)
Our Approach — Overview (II)
Generating Similarity Maximal Matches
Generating similarity maximal matches only can reduce # of exact all-matching tests.
8
C
A A
BG
D
C
A A
B
C
A A
Bq (θ = 1)
C
A A
B C
A A
B C
A A
B C
A A
B C
A A
B
F1 = {u1→v1, u2→v2, u3→ v3, u4→v4}
u1
u4
u2
u3
v1 v2 v5
v4 v3
F2 = {u1→v2, u2→v1, u3→ v3, u4→v4}similarity maximal matches
![Page 10: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/10.jpg)
Our Approach — Problem Statement Similarity Maximal All-Matching Given a query graph q, a data graph G and a
similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ.
9
![Page 11: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/11.jpg)
Our Approach — Seeding (I)
PRIM Order on Spanning Trees Similar to the basic idea of minimum spanning tree. Given a total order on E(q), a spanning tree T =
{T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1], ..., T[i − 1]} and connects {T[0], T[1], ..., T[i − 1]}.
C
A B
D
e1
e2
e3
e4 e5 e6
q
C
A B
D
e1
e2
e3
T
10
![Page 12: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/12.jpg)
Our Approach — Seeding (II)
Avoid Duplicate Results Two spanning trees of q may induce duplicate similarity
maximal matches.
Associate an edge exclusion set T.R to each T in QT.
T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T.
C
A B
D
q (θ = 2)
E
A A
C
G
B
D C
A B
D
T1
C
A B
DT2
T2.R = { (A,D) }T1.R = ∅
11
![Page 13: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/13.jpg)
Our Approach — Seeding (III)
C
A B
D
e1
e2
e3
e4 e5 e6
e1e2
e3
e1e2
e4
e1e2
e5
e1e4
e3
e1e4
e6
e1e5
e3
e4e3
e2
e4e3
e5
e4e5
e2
e6e2
e3
XT1[1]e1
XT1[2]e2
XT1[3]e3
XT2[3]e4
XT4[3]e3
XT4[2] e4
XT7[3]e2
XT7[2]e3
XT7[1]e4
T T.R T T.R
1.e1e2e3
{ }2.e1e2e4
{e3 }3.e1e2e5 {e3,e4 }4.e1e4e3
{e2 }5.e1e4e6
{e2,e3 }
6.e1e5e3
{e2,e4 }7.e4e3e2
{e1 }8.e4e3e5
{e1,e2 }9.e4e5e2
{e1,e3 }10.e6e2e3
{e1,e4 }
q (θ =2)
QT Enumeration Algorithm
go down
alternate-reorder
12
![Page 14: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/14.jpg)
Our Approach — Seeding (IV)
QT Enumeration Algorithm
Correctness : Using QT to inducing similarity maximal matches neither generates duplicate results nor misses valid results.
Minimality of QT : Missing any spanning tree in QT does not guarantee the completeness of results based on edge exclusion semantics.
When |E(q)| = m, |V(q)| = n, (1)|QSAPPER| ≥ |QT|;
(2) |QT| = |QSAPPER| only when θ = 0 or m − n + 1.)|.|
|.|1(||
RT
RTnmQ
TQTSAPPER
13
![Page 15: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/15.jpg)
Our Approach — Searching (I)
Effectively Storing QT
Use DFS Traversal Tree to share computation cost.e1e2
e3
e1e2
e4
e1e2
e5
e1e4
e3
e1e4
e6
e1e5
e3
e4e3
e2
e4e3
e5
e4e5
e2
e6e2
e3
R
e1
e4
e2
e3
e4
e5
e4
e3
e6
e3
e5
e3
e2
e5
e2
e5
e6
e3
e2
14
![Page 16: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/16.jpg)
Our Approach — Searching (II)
Similarity Maximal All-Matching Algorithm Sketch Traverse the DFS Traversal Tree in a depth-first backtrack
search fashion. go-down : Beginning from the initial spanning tree,
recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T.
alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i] .
15
![Page 17: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/17.jpg)
Our Approach — Optimizations
Optimizations (I) EnumrateOnDemand Strategy Motivation : further reduce the number of seeds. Enumerate an alternative tree T’ based on the current
tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’.
Optimizations (II) Effective Search Order Motivation : terminate all-matching test as early as
possible. Decide the search order of spanning edges in T based on
the post-filtering candidate sets of each vertex in q.
16
![Page 18: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/18.jpg)
Our Approach — Filtering & Ordering (I)
Neighborhood Aggregate N(v, g) Given a set of labels ΣV = {L1, ..., Lm}, N(v, g) = (x1,
..., xm) where xi is the number of neighbors of v in g with label Li ∈ ΣV.
Neighborhood-based Filtering Compute the candidate set C(u) for each u in q.
A
B
D
A A
Du ∈ q
A
B
C
B A
Cv ∈ G
N(u, q) = {2, 1, 0, 2}
N(v, G) = {1, 2, 2, 0}
17
![Page 19: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/19.jpg)
Our Approach — Filtering & Ordering (II)
QI Search Ordering [VLDB’08 Shang et al.] Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the
occurrence of vertices in G with l(u)). Pick Next Spanning Edge : The edge (u1, u2) with minimum φ(u1,
u2) (i.e., the occurrences of edges in G with (l(u1), l(u2))) where u1 is a vertex incident on previous picked spanning edges.
Filtering-based Search Ordering Pick Head Vertex : The vertex u in q with minimum number of
candidates (i.e., |C(u)|). Pick Next Spanning Edge : The edge (u1, u2) minimizing
|C(u2)|×φ (u1, u2)/φ(u2)
where u1 is vertex incident on previous picked spanning edges.
18
![Page 20: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/20.jpg)
Experiments — Experimental Settings
Data Graphs GH : HPRD network (|V(GH)| = 9,460, |E(GH)| = 37,081).
GS : default synthetic data graph. Other synthetic data graphs generated by varying data graph settings.
Query Graphs Random selected subgraphs of the corresponding data graphs.
Parameter Settings (default settings in bold)
|V(G)| 5k, 10k, 20k, 40k, 80k
avg. deg(G)
4, 8, 12, 16, 20
|ΣV | 20, 50, 100, 200
|V(q)| 20, 40, 60, 80, 100
avg. deg(q) 3, 4, 5, 6
θ 1, 2, 3, 4
19
![Page 21: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/21.jpg)
|QSAPPER| : # of exact all-matching tests by SAPPER [VLDB’10].
|QT| : # of exact all-matching tests by EnumerateAll paradigm. TSpan : # of exact all-matching tests by EnumerateOnDemand
paradigm.
Experiments — # of exact all-matching tests
20
![Page 22: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/22.jpg)
Similarity All-Matching SAPPER : Generate all similarity matches. TSpan+ : Run TSpan first and then generate all similarity
matches based on similarity maximal matches.
Similarity Maximal All-Matching NaïveTSpan : Similarity maximal all-matching with no
computation sharing. TSpan : Similarity maximal all-matching with computation
sharing.
Experiments — Total Processing Time
21
![Page 23: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/23.jpg)
Enumeration Paradigms PrecTSpan : Similarity maximal
all-matching by EnumerateAll. TSpan : Similarity maximal all-
matching by EnumerateOnDemand.
Filtering & Ordering TSpanQI : TSpan algorithm with
QI searching ordering. TSpanNF : TSpan algorithm
with no filtering technique.
Experiments — Total Processing Time
22
![Page 24: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/24.jpg)
TSpan on Large-scale Datasets
Experiments — Large-scale Data Graphs
23
![Page 25: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/25.jpg)
Conclusions
Tree-based Spanning Search Paradigm
EnumerateOnDemand Strategy Filtering-based Search Ordering
SAPPER TSpan
# of all-matching tests
significantly less
each all-matching test
graph to graph tree to graph
computation-sharing no yes
similarity results non-maximal maximal
mC
24
![Page 26: Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey](https://reader035.vdocuments.mx/reader035/viewer/2022062409/5697bf9c1a28abf838c93664/html5/thumbnails/26.jpg)
Thank You!Any Questions?