technische universität münchen large-scale graph mining using backbone refine- ment classes...
TRANSCRIPT
Technische Universität München
Large-Scale Graph Mining Using Backbone Refine-ment Classes
05/2009
Andreas Maunz1, Christoph Helma1,2, and Stefan Kramer3
1) FDM Universität Freiburg (D)2) in-silico toxicology Basel (CH)3) Technische Universität München (D)
Technische Universität München
BACKBONE REFINEMENT CLASS MININGEfficient diverse substructure mining from a large class-labelled graph database
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Rationale
Trees are most frequent substructure type; yet efficiently enumerable. However:
• Excessively large result sets are obtained even for high correlation and minimum frequency constraints.
Paths; 5%Real Trees;
85%
Cycle-clos-ing Graphs;
10%Typical substructure frequencies for databases of small molecules
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Definitions
4
GASTON (GrAph, Sequence and Tree ExtractiON) by Nijssen and Kok1:
1 Nijssen S. & Kok J.N.: “A Quickstart in Frequent Structure Mining can make a Difference”, KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM 2004: 647–652.
• Backbone of a tree: longest path with the lowest sequence (assuming canonical sequence ordering).
• Since every tree has exactly one backbone, backbones partition the partial order of trees disjointly.
Backbone Refinement Class (BBRC): All tree refinements growing from a specific backbone.
• Pre-order (depth-first) traversal is used within each partition to refine structures.
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Example
5
C-C(-O-C)(=C-c:c:c)
C-C(=C(-O-C)(-C))(-c:c:c)
C-C(=C-O-C)(-c:c:c)
Class 1
Class 2
Refinement
Refinement
Backbone:c:c:c-C=C-O-C
Backbones in gray
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Properties (1)
6
• BBRCs partition the search space structurally (as opposed to occurrence-based methods, such as open/closed features).
Search space for two BBRCs within the same backbone.
Some Properties
• Two types of BBRCs:
I. within a backbone: not disjoint(see figure on the left)
II. across backbones: disjoint
• A given backbone spans a maximum search tree. No node may be added without changing the backbone.
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Properties (2)
7
Consider the special case of a rooted perfect binary tree of height h.
Backbone with branches in gray
Perfect binary tree of height 3
The Number of BBRCs
→ The number of Backbone Refinement Classes is governed by the (recursive)branches on this backbone.
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Properties (3)
8
The Number of BBRCs (unpublished)
The number of backbone refinement classes of a branch of length l is
1
1
)()1()!1()(l
i
iblllb
1 L.A. Szekely, Hua Wang, On subtrees of trees, Advances in Applied Mathematics, Volume 34, Issue 1, January 2005, Pages 138-155,
The full set of subtrees containing the root has size [1]
1)( )2( 1
h
qhF
where q~1.50284.
2,)()2()()2()!2(2)(1
1
1
12,
2
htbjsbijihBi
s
j
t
h
ji
ji
The set of BBRCs containing the root has size
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Properties (4)
9
Summary of Feature Counts
h B(h) F(h)
1 1 4
2 8 25
3 632 676
4 6.03E+004 4.58E+005
5 1.19E+007 2.10E+011
6 4.13E+009 4.41E+022
7 2.14E+012 1.95E+045
8 1.54E+015 3.79E+0901 2 3 4 5 6 7 8
1.00E+000
1.00E+010
1.00E+020
1.00E+030
1.00E+040
1.00E+050
1.00E+060
1.00E+070
1.00E+080
1.00E+090
1.00E+100
B(h)F(h)
Large-Scale Graph Mining using Backbone Refinement Classes04
BBRC Implementation
10
Use paths as candidate backbones. Idea: Mine BBRCs and represent each BBRC by the most (2-) significant member.
1 S. Morishita and J. Sese. Traversing Itemset Lattices with Statistical Metric Pruning. In Symposium on Principles of Database Systems, pages 226–236, 2000.
• 2 thresholds can not be used for anti-monotonic pruning, however an upper bound for 2 values of refinements of a pattern exists1 (Statistical Metric Pruning).
Dynamic Upper Bound Pruning: 2 threshold may be increased during depth-first traversal since we only search for the max. elements of classes.
• In case of several most significant members, use the most general one.
Large-Scale Graph Mining using Backbone Refinement Classes0411
BBRC Experiments (1)
Investigation of BBRCs regarding time efficiency, feature set sizes and expressiveness
• BBRC Representatives: most significant representatives of the backbone refinement classes.
Class-Balanced CPDB datasets:
• Salmonella Mutagenicity (SM, 388 active / 810 compounds)
• Rat Carcinogenicity (RC, 459 active / 1145 compounds)
• Mouse Carcinogenicity (MoC, 428 active / 927 compounds)
• Multicell Call (MuC, 553 active / 1067 compounds).
• Significant Trees: all trees that are frequent and significant.
• Open Trees[1]: most general significant trees with the same occurrences.
1 B. Bringmann, A. Zimmermann, L. de Raedt, and S. Nijssen. Don’t Be Afraid of Simpler Patterns. In Proceedings 10th PKDD, pages 55–66. Springer-Verlag, 2006.
Large-Scale Graph Mining using Backbone Refinement Classes0412
BBRC Experiments (2)
Feature Set Sizes
Sign. Trees
Open Trees
BBRC Repr.
SM 27,093 8,062 2,715
RC 94,991 4,569 5,183
MoC 22,395 1,937 3,083
MuC 29,970 5,122 3,636
Minimum frequency: 6
Large-Scale Graph Mining using Backbone Refinement Classes0413
BBRC Experiments (3)
Time Efficiency
No statisticalpruning
Static UBpruning
Dynamic UB
pruningSM 2.63 2.55 0.44
RC 21.23 21.11 6.63
MoC 3.71 2.98 2.13
MuC 5.17 4.76 1.76
Minimum frequency: 6
Large-Scale Graph Mining using Backbone Refinement Classes0414
BBRC Experiments (4)
Accuracy, Sensitivity, Specificity
Black: Sign. TreesDark Gray: BBRC-R.Light Gray: Open Trees
Sign. Tr. Open Tr. BBRC-R.
all 74.6 75.5 74.6SM AD 80.7 80.6 79.4
wt. 86.8 84.5 85.4 all 64.4 64.5 67.2RC AD 70.0 68.7 70.4
wt. 81.8 80.0 82.2 all 73.3 71.5 71.7MoC AD 75.7 74.4 76.5 wt. 83.7 80.8 82.0 all 71.9 70.2 70.3MuC AD 75.6 73.5 74.1 wt. 83.5 81.3 84.9
Instance-based predictionsall: all predictionsAD: top 80% confidence predictionswt.: predictions weighted by confidence
Large-Scale Graph Mining using Backbone Refinement Classes0415
BBRC Experiments (5)
Active / Inactive compoundsActivating / Deactivatingfeatures
Euclidean embedding based on Co-Occurrences and Entropy[1]
1 Hannes Schulz, Christian Kersting, Andreas Karwath, ILP, the Blind, and the Elephant: Euclidean Embedding of Co-Proven Queries (Proceedings of the 19th International Conference on Inductive Logic Programming (ILP 2009) (forthcoming)).
Differently colored features nearly perfectly separated
Features are well distributed with few clusters
Large-Scale Graph Mining using Backbone Refinement Classes0416
Large-Scale Analysis (1)
Large Scale Analysis
NCI Yeast Anticancer Drug Screen datasets (April 2002 release)
1. AC-One (stage 0): 87,264 compounds, 12,068 active
2. AC-All (stage 0): 87,264 compounds, 5,777 active
3. AC-All (stage 1): 10,924 compounds, 5,433 active
To the best knowledge of the authors, 1. and 2. are the largest labelled datasets that have been considered in correlated graph mining.
Large-Scale Graph Mining using Backbone Refinement Classes0417
Large-Scale Analysis (2)
BBRC descriptors are more probable in lighter regions.
AC-One (stage 0): 87,264 comp:
Min. Freq. Coverage Time eff.
100 (~0.12 %) 47.1 36m40s
Similar results were obtained for the other datasets*.
* The effects of not using aromatic perception, i.e. no special node and edge labels for aromatic bindings, were much greater. The number of descriptors per compound in this setting was > 80 for both thresholds.
Effects of Minimum Frequency on Dataset Coverage
200 (~0.23%) 44.7 19m40s
Large-Scale Graph Mining using Backbone Refinement Classes04
Large-Scale Analysis (3)
Feature Count for Balanced datasets (downsampling)
1 M. Al Hasan et.al. Origami: Mining Representative Orthogonal Graph Patterns. ICDM 2007. Seventh IEEE International Conference on Data Mining, pages 153–162, Oct. 2007.
Max. Trees: the positive border as implied by minimum frequency and significance constraints [1].
Open Trees Memory alloc. error
216,206
AC-one (stage 0)23,400 comp.
AC-all (stage 1)10,548 comp.
Sign. Trees 1,190,763 291,729
Max. Trees[1]
556,673 148,562
BBRC Repr. 31,450 14,381
Large-Scale Graph Mining using Backbone Refinement Classes0419
Large-Scale Analysis (4)
Time Efficiency
Time efficiency (Mining)
AC-one (st. 0), 23.400 4m52s
AC-all (st. 1), 10548 1m13s
Open Trees:prediction times of >60simpractical RAM demand.
AC-one (st. 0) 11.1s
AC-all (st. 1) 4.7s
Time efficiency (Prediction)
all AD wt.62646668707274767880
AC-one (st. 0)
AC-all (st. 1)
all: all predictionsAD: top 80% confidence predictionswt.: predictions weighted by confidence
Accuracy
Open Trees:mining times of ~12h
Technische Universität München
SUMMARY
Large-Scale Graph Mining using Backbone Refinement Classes04
• Structurally heterogeneous descriptors, compression by structural invariant (backbone constraint)
Backbone Refinement Class Representatives
Summary (1)
• Good dataset coverage, robust against increasing minimum frequencies
• Applicable to large-scale graph databases through a novel statistical pruning technique
Large-Scale Graph Mining using Backbone Refinement Classes04
• Compression of 90% compared to all trees and 31% compared to open trees
Backbone Refinement Class Representatives
Summary (2)
• Time efficiency improved by 85% and 83% versus no statistical pruning and static upper bound pruning, respectively.
• Discriminative potential similar to complete set of trees, but significantly better than open trees.
Large-Scale Graph Mining using Backbone Refinement Classes04
Acknowledgements
The authors would like to thank Björn Bringmann for providing a binary and friendly cooperation in dataset testing, and Ulrich Rückert for providing datasets.
The research was (partially) supported by the EU seventh framework programme under contract no Health-F5-2008-200787 (OpenTox).
http://www.opentox.org
C++ implementation: http://www.maunz.de/libfminer-doc