efficient mining of graph-based data
DESCRIPTION
Efficient Mining of Graph-Based Data. Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue. Motivation. Structural/relational data Ease of graph representation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/1.jpg)
CSE@UTA SRL Workshop 1
Efficient Mining of Graph-Based Data
Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook
University of Texas at ArlingtonDepartment of Computer Science and
Engineering
http://cygnus.uta.edu/subdue
![Page 2: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/2.jpg)
CSE@UTA SRL Workshop 2
Motivation Structural/relational data Ease of graph representation
![Page 3: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/3.jpg)
CSE@UTA SRL Workshop 3
Graph-Based Discovery
object
triangle
R1
C1
T1
B1
T2
B2
T3
B3
T4
B4
Input Database Substructure S1 (graph form)
Compressed Database
R1
C1object
squareon
shape
shape S1S1 S1S1 S1S1
S1S1
![Page 4: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/4.jpg)
CSE@UTA SRL Workshop 4
Algorithm
1. Create substructure for each unique vertex label
Substructures:
triangle (4), square (4),circle (1), rectangle (1)
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
on
![Page 5: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/5.jpg)
CSE@UTA SRL Workshop 5
Algorithm
2. Expand best substructure by an edge or edge+neighboring vertex
Substructures:
triangle
square
on
rectangle
square
on
rectangle
triangleon
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
on
rectangle
circle
on
![Page 6: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/6.jpg)
CSE@UTA SRL Workshop 6
Algorithm
3. Keep only best beam-width substructures on queue
4. Terminate when queue is empty or #discovered substructures >= limit
5. Compress graph and repeat to generate hierarchical description
Note: polynomially constrained
![Page 7: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/7.jpg)
CSE@UTA SRL Workshop 7
Evaluation Metric Substructures evaluated based on
ability to compress input graph Compression measured using
minimum description length (DL) Best substructure S in graph G
minimizes: DL(S) + DL(G|S)
![Page 8: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/8.jpg)
CSE@UTA SRL Workshop 8
Examples
![Page 9: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/9.jpg)
CSE@UTA SRL Workshop 9
Inexact Graph Match Some variations may occur
between instances Want to abstract over minor
differences Difference = cost of transforming
one graph to isomorphism of another
Match if cost/size < threshold
![Page 10: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/10.jpg)
CSE@UTA SRL Workshop 10
Parallel/Distributed Discovery Divide graph into P partitions using
Metis, distribute to P processors Each processor performs serial Subdue
on local partition Broadcast best substructures, evaluate
on other processors Master processor stores best global
substructures Close to linear speedup
![Page 11: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/11.jpg)
CSE@UTA SRL Workshop 11
Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses
positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered)
Multiple iterations implements set-covering approach
![Page 12: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/12.jpg)
CSE@UTA SRL Workshop 12
Concept-Learning Example
object
object
object
on
on
triangle
square
shape
shape
![Page 13: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/13.jpg)
CSE@UTA SRL Workshop 13
Concept-Learning Results Chess endgames (19,257
examples) Black King is (+) or is not (-) in
check 99.8% FOIL, 99.21% Subdue
![Page 14: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/14.jpg)
CSE@UTA SRL Workshop 14
More Concept-Learning Results
Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL
Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL
![Page 15: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/15.jpg)
CSE@UTA SRL Workshop 15
Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure)
inserted into a classification lattice
Root
![Page 16: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/16.jpg)
CSE@UTA SRL Workshop 16
Clustering Example: Animals
Name Body Cover Heart Chamber Body Temp. Fertilization
mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal
amphibian moist-skin three unregulated external
fish scales two unregulated external
animal
hair
mammal
BodyCover
Fertilization
HeartChamber
BodyTempinternalregulated
Namefour
![Page 17: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/17.jpg)
CSE@UTA SRL Workshop 17
Graph-Based Clustering Results
Animals
BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal
Fertilization: externalName: mammalBodyCover: hair
Name: birdBodyCover: feathers
Name: reptileBodyCover: cornified-skin
HeartChamber: imperfect-fourFertilization: internal
Name: fishBodyCover: scales
HeartChamber: two
Name: amphibianBodyCover: moist-skinHeartChamber: three
![Page 18: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/18.jpg)
CSE@UTA SRL Workshop 18
Cobweb Results
Comparison of Subdue and Cobweb results Subdue lattice produced better generalization,
resulting in less clusters at higher levels Subdue lattice identifies overlap between
(reptile) and (amphibian/fish)
animals
amphibian/fishmammal/bird reptile
mammal bird fish amphibian
![Page 19: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/19.jpg)
CSE@UTA SRL Workshop 19
Clustering Example: DNA
![Page 20: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/20.jpg)
CSE@UTA SRL Workshop 20
Graph-Based Clustering Results
Coverage 61%
68%
71%
DNA
O |O == P — OH
C — N C — C
C — C \ O
O |O == P — OH | O | CH2
C \ N — C \ C
O \ C / \ C — C N — C / \O C
![Page 21: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/21.jpg)
CSE@UTA SRL Workshop 21
Evaluation of Clusterings Traditional evaluation:
Not applicable to hierarchical domains Does not make sense to compare clusters
in different subtrees Not applicable to relational clusterings
erDistanceIntraClust
erDistanceInterClustQualityClustering
![Page 22: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/22.jpg)
CSE@UTA SRL Workshop 22
Properties of Good Clusterings
Small number of clusters Large coverage good generality
Big cluster descriptions More features more inferential power
Minimal or no overlap between clusters More distinct clusters better defined
concepts
![Page 23: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/23.jpg)
CSE@UTA SRL Workshop 23
New Evaluation Heuristic for Hierarchical Clusterings
c
iHc
i
c
ijji
c
i
c
ij
H
k
H
l ljkisize
ljki
C i
i j
CQHH
HH
HHdistance
CQ1
1
1 1
1
1 1 1 1 ,,
,,
)(
),(max
),(
Clustering rooted at C with c children Hi having |Hi| instances Hi,k
distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7
![Page 24: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/24.jpg)
CSE@UTA SRL Workshop 24
Graph-Based Data Mining: Application Domains Biochemical domains
Protein data DNA data Toxicology (cancer) data
Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System
Telecommunications data Program source code Web topology
web_page
web_page
web_page
hyperlink
hyperlink
hyperlink
home …
…
![Page 25: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/25.jpg)
CSE@UTA SRL Workshop 25
Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]
![Page 26: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/26.jpg)
CSE@UTA SRL Workshop 26
Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on
minimum description length
![Page 27: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/27.jpg)
CSE@UTA SRL Workshop 27
Future Work Concept learning
Theoretical analysis Comparison to ILP systems
Clustering Classification lattice Hierarchical relational conceptual clustering
evaluation metric Probabilistic substructures Domains: WWW, source code
![Page 28: Efficient Mining of Graph-Based Data](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56812b47550346895d8f60b6/html5/thumbnails/28.jpg)
CSE@UTA SRL Workshop 28
Subdue Source Code and Data
http://cygnus.uta.edu/subdue