graph mining applications to machine learning problems

Post on 30-Dec-2015

58 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Graph Mining Applications to Machine Learning Problems. Max Planck Institute for Biological Cybernetics Koji Tsuda. Graphs …. A. C. G. C. UA. CG. CG. U. U. U. U. Graph Structures in Biology. Compounds. DNA Sequence RNA Texts in literature. H. C. C. C. H. H. O. C. C. H. - PowerPoint PPT Presentation

TRANSCRIPT

1

Graph Mining Applications to Machine Learning Problems

Max Planck Institute for Biological Cybernetics

Koji Tsuda

2

Graphs…

3

DNA Sequence

RNA

Texts in literature

Graph Structures in Biology

C

C OC

C

C

C

H

A C G C

Amitriptyline inhibits adenosine uptake

H

H

H

H

H

Compounds

CG

CG

U U U U

UA

4

Substructure Representation

0/1 vector of pattern indicatorsHuge dimensionality!Need Graph Mining for selecting featuresBetter than paths (Marginalized graph kernels)

patterns

5

OverviewQuick Review on Graph Mining

EM-based Clustering algorithm Mixture model with L1 feature selection

Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining

6

Quick Review of Graph Mining

7

Graph MiningAnalysis of Graph Databases Find all patterns satisfying

predetermined conditions Frequent Substructure Mining

Combinatorial, ExhaustiveRecently developed AGM (Inokuchi et al., 2000), gspan

(Yan et al., 2002), Gaston (2004)

8

Graph Mining

Frequent Substructure Mining Enumerate all patterns occurred in at

least m graphs

:Indicator of pattern k in graph i

Support(k): # of occurrence of pattern k

9

Gspan (Yan and Han, 2002)

Efficient Frequent Substructure Mining MethodDFS Code

Efficient detection of isomorphic patterns

Extend Gspan for our works

10

Enumeration on Tree-shaped Search Space

Each node has a patternGenerate nodes from the root: Add an edge at each step

11

Tree PruningAnti-monotonicity:

If support(g) < m, stop exploring!

Not generated

Support(g): # of occurrence of pattern g

12

Discriminative patterns:Weighted Substructure Mining

w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining

Patterns with large frequency differenceNot Anti-Monotonic: Use a bound

13

Multiclass version

Multiple weight vectors (graph belongs to

class ) (otherwise)

Search patterns overrepresented in a class

14

EM-based clustering of graphs

Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006       

15

EM-based graph clustering

Motivation Learning a mixture model in the

feature space of patterns Basis for more complex probabilistic

inference

L1 regularization & Graph MiningE-step -> Mining -> M-step

16

Probabilistic ModelBinomial Mixture

Each Component

:Mixing weight for cluster :Feature vector of a graph (0 or 1)

:Parameter vector for cluster

17

Function to minimize

L1-Regularized log likelihood

Baseline constant ML parameter estimate using single

binomial distribution

In solution, most parameters exactly equal to constants

18

E-step

Active pattern

E-step computed only with active patterns (computable!)

19

M-stepPutative cluster assignment by E-step

Each parameter is solved separately

Use graph mining to find active patternsThen, solve it only for active patterns

20

Solution

Occurrence probability in a cluster

Overall occurrence probability

21

Important Observation

For active pattern k, the occurrence probability in a graphcluster is significantly different from the average

22

Mining for Active Patterns F

F is rewritten in the following form

Active patterns can be found by graph mining! (multiclass)

23

Experiments: RNA graphsStem as a nodeSecondary structure by RNAfold0/1 Vertex label (self loop or not)

24

Clustering RNA graphs

Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems Results evaluated by ROC scores

(Area under the ROC curve)

25

Examples of RNA Graphs

26

ROC Scores

27

No of Patterns & Time

28

Found Patterns

29

Summary (EM)Probabilistic clustering based on substructure representation Inference helped by graph miningMany possible extensions Naïve Bayes Graph PCA, LFD, CCA Semi-supervised learning

Applications in Biology?

30

Graph Boosting

Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006

31

Graph Regression Problem

Known as QSAR problem in chemical informatics Quantitative Structure-Activity

Analysis

Given a graph, predict a real-value Typically, features (descriptors) are

given

32

QSAR with conventional descriptors

#atoms #bonds #rings … Activity

22 25 3

20 21 1.2

23 24 0.77

11 11 -3.52

21 22 -4

33

Motivation of Graph Boosting

Descriptors are not always availableNew features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpanLinear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results

34

Molecule as a labeled graph

C

C

CC

CC

O

CC C

C

35

QSAR with patterns… Activity

1 1 1 3

-1 1 -1 1.2

-1 1 -1 0.77

-1 1 -1 -3.52

1 1 -1 -4

C

C

C

C

C

C

CC

C

C

C

C

CC

CC

O

Cl

C

)? (fC

C

C

C

C

C

CC

C

C

C

C

CC

CC

O

Cl

C1

2 3 ...

36

Sparse regression in a very high dimensional space

G: all possible patterns (intractably large)|G|-dimensional feature vector x for a molecule Linear Regression

Use L1 regularizer to have sparse αSelect a tractable number of patterns

d

jjjxαf

1

)(x

37

Problem formulation

We introduce ε-insensitive loss and L1 regularizer

m: # of training graphs

d = |G|

ξ+, ξ- : slack variables

ε: parameter

38

Dual LP

Primal: Huge number of weight variables Dual: Huge number of constraintsLP1-Dual

39

Column Generation Algorithm for LP Boost (Demiriz et al., 2002)

Start from the dual with no constraintsAdd the most violated constraint each timeGuaranteed to converge Constraint Matrix

UsedPart

40

Finding the most violated constraint

Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

m

iijixu

1

11

m

iijij xu

1

maxarg

41

Algorithm Overview

Iteration Find a new pattern by graph mining with

weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual

Return Convert dual solution to obtain primal

solution α

42

Speed-up by adding multiple patterns (multiple pricing)

So far, the most violated pattern is chosen

Mining and inclusion of top k patterns at each iteration Reduction of the number of mining

calls

m

iijij xu

1

maxarg

A Linear Programming Approach for Molecular QSAR Analysis

43

Speed-up by multiple pricing

44

Clearly negative data#atoms #bonds #rings … Activity

22 25 3

20 21 1.2

23 24 0.77

11 11 -3.52

21 22 -4

22 20 -10000

23 19 -10000

A Linear Programming Approach for Molecular QSAR Analysis

45

Inclusion of clearly negative data

LP2-Primal

l: # of clearly negative data

z: predetermined upperbound

ξ’ : slack variable

46

Experiments

Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61

compounds labeled by a large negative number

Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between –1 and 1

Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression

47

Results with or without clearly negative data

LP2

LP1

48

Extracted patterns

Interpretable compared with implicitly expressed features by Marginalized Graph Kernel

49

Summary (Graph Boosting)

Graph Boosting simultaneously generate patterns and learn their weightsFinite convergence by column generationPotentially interpretable by chemists.Flexible constraints and speed-up by LP.

50

Concluding Remarks

Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you

implement your item-set/tree/graph mining algorithms

Make it available on the web! Then ML researchers can use it

top related