chapter x: graph miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... ·...
TRANSCRIPT
![Page 1: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/1.jpg)
Information Retrieval & Data MiningUniversität des Saarlandes, SaarbrückenWinter Semester 2013/14
X.1–3-
Chapter X: Graph Mining
1
![Page 2: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/2.jpg)
IR&DM ’13/14 21 January 2014 X.1–3-
Chapter X: Graph Mining1. Introduction to Graph Mining2. Centrality and Other Graph Properties3. Frequent Subgraph Mining
3.1. Graphs and Isomorphism3.2. Canonical Codes3.3. gSpan
4. Graph Clustering4.1. Clustering as Graph Cutting4.2. Spectral Clustering4.3. Markov Clustering
2
ZM Ch. 4, 11, 16
![Page 3: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/3.jpg)
IR&DM ’13/14 21 January 2014 X.1–3-
Chapter X.1: Introduction1. Why Graphs?2. What are Graphs?3. What to do with Graphs?
3
![Page 4: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/4.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Why Graphs?• Many real-world data sets are in the forms of graphs – Social networks–Hyperlinks– Protein–protein interaction–XML parse trees–…
• Many of these graphs are enormous–Humans cannot understand ⇒ task for data mining!
4
![Page 5: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/5.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
What are Graphs?• A graph is a pair (V, E ⊆ V2)–Elements in V are vertices or nodes of the graph– Pairs (v, u) in E are edges or arcs of the graph• Pairs can be either ordered or unordered for directed graphs or
undirected graphs, respectively
• The graphs can be labelled–Vertices can have labeling L(v)–Edges can have labeling L(v, u)
• A tree is a rooted, connected, and acyclic graph• Graphs can be represented using adjacency matrices– |V|×|V| matrix A with (A)ij = 1 if (vi, vj) ∈ E
5
![Page 6: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/6.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Eccentricity, Radius & Diameter• The distance d(vi, vj) between two vertices is the
(weighted) length of the shortest path between them• The eccentricity of a vertex vi, e(vi), is its maximum
distance to any other vertex, maxj{d(vi, vj)}• The radius of a connected graph, r(G), is the minimum
eccentricity of any vertex, mini{e(vi)}• The diameter of a connected graph, d(G), is the
maximum eccentricity of any vertex, maxi{e(vi)} = maxi,j{d(vi, vj)}– The effective diameter of a graph is smallest number that is
larger than the eccentricity of a large fraction of the vertices in the graph
6
![Page 7: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/7.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Clustering Coefficient• The clustering coefficient of vertex vi, C(vi), tells
how clique-like the neighbourhood of vi is–Let ni be the number of neighbours of vi and mi the number
of edges between the neighbours of vi (vi excluded)
–Well-defined only for vi with at least two neighbours• For others, let C(vi) = 0
• The clustering coefficient of the graph is the average clustering coefficient of the vertices: C(G) = n–1ΣiC(vi)
7
C(vi) = mi/
✓ni
2
◆=
2mi
ni(ni �1)
![Page 8: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/8.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
What to do with Graphs?• There are many interesting data one can mine from
graphs and sets of graphs–Cliques of friends from social networks–Hubs and authorities from link graphs–Who is the centre of the Hollywood– Subgraphs that appear frequently in a set of graphs–Areas with higher inter-connectivity than intra-connectivity–…
• Graph mining is perhaps the most popular topic in contemporary data mining research–Though not necessary called as such…
8
This week
![Page 9: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/9.jpg)
IR&DM ’13/14 21 January 2014 X.1–3-
Chapter X.2: Centrality and Other Graph Properties
1. Centrality2. Graph Properties
9
ZM Ch. 4
![Page 10: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/10.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Centrality• Six degrees of Kevin Bacon– ”Every actor is related to Kevin
Bacon by no more than 6 hops”–Kevin Bacon has acted with many,
that have acted with many others,that have acted with many others…
• That makes Kevin Bacon acentre of the co-acting graph–Although he’s not the centre: the
average distance to him is 2.998but to Harvey Keitel it is only2.848
10
http://oracleofbacon.org
![Page 11: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/11.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Degree and Eccentricity Centrality• Centrality is a function c: V → ℝ that induces a total
order in V–The higher the centrality of a vertex, the more important it
is• In degree centrality c(vi) = d(vi), the degree of the
vertex• In eccentricity centrality the least eccentric vertex is
the most central one, c(vi) = 1/e(vi)–The lest eccentric vertex is central –The most eccentric vertex is peripheral
11
![Page 12: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/12.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Closeness Centrality• In closeness centrality the vertex with least distance
to all other vertices is the centre
• In eccentricity centrality we aim to minimize the maximum distance• In closeness centrality we aim to minimize the
average distance–This is the distance used to measure the centre of
Hollywood
12
c(vi) =
Âj
d(vi,v j)
!�1
![Page 13: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/13.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Betweenness Centrality• The betweenness centrality measures the number of
shortest paths that travel through vi
–Measures the “monitoring” role of the vertex– “All roads lead to Rome”
• Let ηjk be the number of shortest paths between vj and vk and let ηjk(vi) be the number of those that include vi
–Let γjk(vi) = ηjk(vi)/ηjk
–Betweenness centrality is defined as
13
c(vi) = Âj 6=i
Âk 6=ik> j
g jk
![Page 14: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/14.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Prestige• In prestige, the vertex is more central if it has many
incoming edges from other vertices of high prestige–A is the adjacency matrix of the directed graph G – p is n-dimensional vector giving the prestige of the vertices– p = ATp– Starting from an initial prestige vector p0, we getpk = ATpk–1 = AT(ATpk–2) = (AT)2pk–2 = (AT)3pk–3 = … = (AT)kp0
• Vector p converges to the dominant eigenvector of AT
–Under some assumptions• N.B. PageRank is based on (normalized) prestige
14
![Page 15: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/15.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Graph Properties• Several real-world graphs exhibit certain
characteristics– Studying what these are and explaining why they appear is
an important area of network research• As data miners, we need to understand the
consequences of these characteristics– Finding a result that can be explained merely by one of
these characteristics is not interesting• We also want to model graphs with these
characteristics
15
![Page 16: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/16.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Small-World Property• A graph G is said to exhibit a small-world property
if its average path length scales logarithmically,µL ∝ log n–The six degrees of Kevin Bacon is based on this property–Also the Erdős number•How far a mathematician is from Hungarian combinatorist Paul
Erdős•A radius of a large, connected mathematical co-authorship
network (268K authors) is 12 and diameter 23
16
![Page 17: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/17.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Scale-Free Property• The degree distribution of a graph is the distribution
of its vertex degrees –How many vertices with degree 1, how many with degree 2,
etc.– f(k) is the number of edges with degree k
• A graph is said to exhibit scale-free property if f(k) ∝ k–γ
– So-called power-law distribution–Majority of vertices have small degrees, few have very high
degrees– Scale-free: f(ck) = α(ck)–γ = (αc–γ)k–γ ∝ k–γ
17
![Page 18: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/18.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Example: WWW Links
18
IRDM WS 2007 5-6
Web Structure: Power-Law Degrees(Scale-Free Network)
• power-law distributed degrees: P[degree=k] ~ (1/k)�with � � 2.1 for indegrees and � � 2.7 for outdegrees
Study of Web Graph (Broder et al. 2000)
Broder et al. Graph structure in the web. WWW’00
s = 2.09 s = 2.72
In-degree Out-degree
![Page 19: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/19.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Clustering Effect• A graph exhibits clustering effect if the distribution
of average clustering coefficient (per degree) follow the power law– If C(k) is the average clustering coefficient of all vertices of
degree k, then C(k) ∝ k–γ
• The vertices with small degrees are part of highly clustered areas (high clustering coefficient) while “hub vertices” have smaller clustering coefficients
19
![Page 20: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/20.jpg)
IR&DM ’13/14 21 January 2014 X.1–3-
Chapter X.3: Frequent Subgraph Mining1. Graphs and Isomorphism
1.1. Definitions1.2. Support of a subgraph
2. Canonical Codes3. gSPAN Algorithm4. Easier Problems
20
ZM Ch. 11
![Page 21: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/21.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Graphs and Isomorphism• Graph (V’, E’) is the subgraph of graph (V, E) if –V’ ⊆ V –E’ ⊆ E
• Note that subgraphs don’t have to be connected–Today we consider only connected subgraphs
• To check whether a graph is a subgraph of other is trivial–But in most real-world applications there are no direct
subgraphs–Two graphs might be similar even if their vertex sets are
disjoint
21
![Page 22: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/22.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Graph Isomorphism• Graphs G = (V, E) and G’ = (V’, E’) are isomorphic if
there exists a bijective function φ: V → V’ such that– (u, v) ∈ E if and only if (φ(u), φ(v)) ∈ E’– L(v) = L(φ(v)) for all v ∈ V– L(u, v) = L(φ(u), φ(v)) for all (u, v) ∈ E
• Graph G’ is subgraph isomorphic to G if there exists a subgraph of G which is isomorphic to G’• No polynomial-time algorithm is known for
determining if G and G’ are isomorphic• Determining if G’ is subgraph isomorphic to G is NP-
hard22
![Page 23: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/23.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Equivalence and Canonical Graphs• Isomorphism defines an equivalence class– id: V → V, id(v) = v shows G is isomorphic to itself– If G is isomorphic to G’ via φ, then G’ is isomorphic to G
via φ–1
– If G is isomorphic to H via φ and H to I via χ, then G is isomorphic to I via φ○χ
• A canonization of a graph G, canon(G) produces another graph C such that if H is a graph that is isomorphic to G, canon(G) = canon(H)–Two graphs are isomorphic if and only if their canonical
versions are the same
23
![Page 24: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/24.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example of Isomorphic Graphs
24
ab
ca
b a
![Page 25: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/25.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example of Isomorphic Graphs
25
a
b
c
a
ba
![Page 26: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/26.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example of Isomorphic Graphs
26
a b
ca
b a
a
b
c
a
ba
![Page 27: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/27.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Frequent Subgraph Mining• Given a set D of n graphs and a minimum support
parameter minsup, find all connected graphs that are subgraph isomorphic to at least minsup graphs in D–Enormously complex problem– For graphs that have m vertices there are• subgraphs (not all are connected)
– If we have s labels for vertices and edges we have• labelings of the different graphs
–Counting the support means solving multiple NP-hard problems
27
2O(m2)
O⇣(2s)O(m2)
⌘
![Page 28: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/28.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example
28
ab
c
a
b
a
a
b
a
c
a
b
a
![Page 29: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/29.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Canonical Codes• We can improve the running time of frequent
subgraph mining by either–Making the frequency check faster•Lots of efforts in faster isomorphism checking but only little
progress–Creating less candidates that need to be checked•Level-wise algorithms (like AGM) generate huge numbers of
candidates•Each must be checked with for isomorphism with others
• The gSpan (graph-based Substructure pattern mining) algorithm replaces the level-wise approach with a depth-first approach
29
Yan & Han 2002; Z&M Ch. 11
![Page 30: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/30.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Depth-First Spanning Tree• A depth-first spanning (DFS) tree of a graph G– Is a connected tree–Contains all the vertices of G– Is build in depth-first order• Selection between the siblings is e.g. based on the vertex index
• Edges of the DFS tree are forward edges • Edges not in the DFS tree are backward edges• A rightmost path in the DFS tree is the path travels
from the root to the rightmost vertex by always taking the rightmost child (last-added)
30
![Page 31: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/31.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example
31
a
d
c
c
a
b
b
a
v1
v6
v4
v5
v2
v3
v7
v8
![Page 32: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/32.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
The DFS Tree
32
a
a
a
v1
v6v4
v5
v2
v3 v7
v8
dc
c
b
b
![Page 33: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/33.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Generating Candidates from DFS Tree
33
• Given graph G, we extend it only from the vertices in the rightmost path–We can add backwards edges from the rightmost vertex to
some other vertex in the rightmost path–We can add a forward edge from any vertex in the rightmost
path•This increases the number of vertices by 1
• The order of generating the candidates is– First backward extensions• First to root, then to root’s child, …
–Then forward extensions• First from the leaf, then from leaf’s father, …
![Page 34: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/34.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
An Example
34
a
a
a
v1
v6v4
v5
v2
v3 v7
v8
dc
c
b
b
![Page 35: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/35.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
DFS Codes and their Orders
35
• A DFS code is a sequence of tuples of type ⟨vi, vj, L(vi), L(vj), L(vi,vj)⟩– Tuples are given in DFS order•Backwards edges are listed before forward edges•Vertices are numbered in DFS order
• A DFS code is canonical if it is the smallest of the codes in the ordering– ⟨vi, vj, L(vi), L(vj), L(vi,vj)⟩ < ⟨vx, vy, L(vx), L(vy), L(vx,vy)⟩ if• ⟨vi, vj⟩ <e ⟨vx, vy⟩; or• ⟨vi, vj⟩=⟨vx, vy⟩ and ⟨L(vi), L(vj), L(vi, vj)⟩ <l ⟨L(vx), L(vy), L(vx, vy)⟩
– The ordering of the label tuples is the lexicographical ordering
![Page 36: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/36.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Ordering the Edges• Let eij = ⟨vi, vj⟩ and exy = ⟨vx, vy⟩
• eij <e exy if– If eij and exy are forward edges, then • j < y; or • j = y and i > x
– If eij and exy are backward edges, then • i < x; or • i = x and j < y
– If eij is forward and exy is backward, then i < y– If eij is backward and exy is forward, then j ≤ x
36
![Page 37: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/37.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Example
37
CHAPTER 11. GRAPH PATTERN MINING 286
v1 a
G1
v2 a
v3 a v4 b
q
r r
r
v1 a
G2
v2 a
v3 b v4 a
q
r r
r
v1 a
G3
v2 a v4 b
v3 a
q
r
r
r
t11 = ⟨v1, v2, a, a, q⟩t12 = ⟨v2, v3, a, a, r⟩t13 = ⟨v3, v1, a, a, r⟩t14 = ⟨v2, v4, a, b, r⟩
t21 = ⟨v1, v2, a, a, q⟩t22 = ⟨v2, v3, a, b, r⟩t23 = ⟨v2, v4, a, a, r⟩t24 = ⟨v4, v1, a, a, r⟩
t31 = ⟨v1, v2, a, a, q⟩t32 = ⟨v2, v3, a, a, r⟩t33 = ⟨v3, v1, a, a, r⟩t34 = ⟨v1, v4, a, b, r⟩
Figure 11.6: Canonical DFS Code. G1 is canonical, whereas G2 and G3 are non-canonical. Vertex label set ΣV = {a, b}, and edge label set ΣE = {q, r}.
Here <e is an ordering on the edges and <l is an ordering on the vertex and edgelabels. The label order <l is the standard lexicographic order on the vertex and edgelabels. For example the label tuple ⟨a, a, r⟩ <l ⟨a, b, q⟩ since if we compare the twotuples in an element-wise manner, we find that L(vj) = a < b = L(vy).
The edge order <e is more involved. Let eij = ⟨vi, vj⟩ and exy = ⟨vx, vy⟩. Wewrite eij <e exy if the following conditions are met:
i) If eij and exy are both forward edges, then a) j < y or, b) j = y and i > x.
ii) If eij and exy are both backward edges, then a) i < x or b) i = x and j < y.
iii) If eij is a forward and exy is a backward edge, then i < y.
iv) If eij is a backward and exy is a forward edge, then j ≤ x.
Given the DFS codes for any two graphs, we can compare them tuple by tupleto check which is smaller.
Example 11.2: For example, consider the DFS codes for the three graphs shownin Figure 11.6. Comparing G1 and G2, we find that t11 = t21, but t12 < t22, since⟨a, a, r⟩ <l ⟨a, b, r⟩. Comparing G1 with G3, we find that the first three tuplesare equal in both graphs, but t14 < t34, since ⟨v2, v4⟩ <e ⟨v1, v4⟩. Applying therules for edge ordering <e, we find that condition i) applies, and vj = v4 = vy but
DRAFT @ 2012-09-19 21:46. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.
First rows are identicalIn second row, G2 is bigger in labels’ orderLast rows are forward edges and 4 = 4 but 2 > 1 ⇒ G1 is smallest
![Page 38: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/38.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
The gSPAN Algorithm
38
• The general idea:–Use the DFS codes to create candidates•Extend only canonical and frequent candidates
• There can be very, very many extensions–And we need to see them all, and all of their isomorphisms,
to count the support
![Page 39: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/39.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Building the Candidates
39
• The candidates are build in a DFS code tree–A DFS code a is an ancestor of DFS code b if a is a proper
prefix of b–The siblings in the tree follow the DFS code order
• A graph can be frequent only if all of the graphs representing its ancestors in the DFS tree are frequent• The DFS tree contains all the canonical codes for all
the subgraphs of the graphs in the data–But not all of the vertices in the code tree correspond to
canonical codes• We will (implicitly) traverse this tree
![Page 40: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/40.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
The Algorithm• gSpan:– for each frequent 1-edge graphs• call subgrm to grow all nodes in the code tree rooted in this 1-edge graph• remove this edge from the graph
• subgrm– if the code is not canonical, return–Add this graph to the set of frequent graphs– Create each super-graph with one more edge and compute its
frequency– call subgrm with each frequent super-graph’s canonical
representation
40
![Page 41: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/41.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
How to compute the frequency?• gSPAN merges extension generation and support
computation• For each graph in the data base– gSPAN computes all the isomorphisms of the current
candidate•Can mean solving NP-complete problems…
– For all isomorphisms, gSPAN computes all backward and forward extensions•These extensions are stored together with the graph they appear
in
• The support of each extension is the number of times we’ve stored it
41
![Page 42: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/42.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
How to check the canonicity?• Given a DFS code of an extension, we need to check
if the code is canonical• This can be done by re-creating the code–At every step, choose the smallest of the right-most path
extensions of the current code in the graph corresponding to the extension
• If at any step we get a code that is smaller than the suffix of the extension’s code, we can’t have a canonical code– If after k steps we arrive to the extensions code, the code
was canonical
42
![Page 43: Chapter X: Graph Miningresources.mpi-inf.mpg.de/departments/d5/teaching/ws13_14/... · 2014-01-23 · 3. gSPAN Algorithm 4. Easier Problems 20 ZM Ch. 11. IR&DM ’13/14 21 January](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f224af9f910a62bf426d3b5/html5/thumbnails/43.jpg)
IR&DM ’13/14 X.1–3-21 January 2014
Easier Problems• Much of the complexity of subgraph mining lies in
the isomorphism• But for some types of graphs isomorphism is easy–Different types of trees•Ordered and unordered•Rooted and unrooted
–Graphs where every node has a distinct label
43