-
8/4/2019 Discovering Interesting Molecular Sub Structure
1/13
IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 77
Discovering Interesting Molecular Substructuresfor Molecular Classification
Winnie W. M. Lam and Keith C. C. Chan*, Member, IEEE
AbstractGiven a set of molecular structure data preclassifiedinto a number of classes, the molecular classification problem isconcerned with the discovering of interesting structural patterns inthedata so that unseenmolecules notoriginally in thedataset canbe accurately classified. To tackle the problem, interesting molec-ular substructures have to be discovered and this is done typicallyby first representing molecular structures in molecular graphs,and then, using graph-mining algorithms to discover frequentlyoccurring subgraphs in them. These subgraphs are then used tocharacterize different classes for molecular classification. Whilesuch an approach can be very effective, it should be noted thata substructure that occurs frequently in one class may also doesoccur in another. The discovering of frequent subgraphs for molec-
ular classification may, therefore, not always be the most effective.In this paper, we propose a novel technique called mining interest-ing substructures in molecular data for classification (MISMOC)that can discover interesting frequent subgraphs not just for thecharacterization of a molecular class but also for the distinguishingof it from the others. Using a test statistic, MISMOC screens eachfrequent subgraph to determine if they are interesting. For thosethat are interesting, theirdegrees of interestingness are determinedusing an information-theoretic measure. When classifying an un-seen molecule, its structure is then matched against the interestingsubgraphs in each class and a total interestingness measure forthe unseen molecule to be classified into a particular class is de-termined, which is based on the interestingness of each matchedsubgraphs. The performance of MISMOC is evaluated using bothartificial and real datasets, and the results show that it can be aneffective approach for molecular classification.
Index TermsFrequent subgraph, graph mining, interesting-ness, molecular classification, molecular structures.
I. INTRODUCTION
THE SIZE and number of molecular structure databases
have grown rather rapidly recently, due to advances in
X-ray diffraction or nuclear magnetic resonance (NMR) tech-
nologies [1]. Molecular databases of nucleotide, genome, pro-
tein and nucleic acid, etc., such as NCBI, MINT, SwissMod,
and FSSP in EMBL [2][5] have been made available online.
These databases continue to grow in size and diversity, andthere is an increasing need for techniques to be developed to
mine these data for interesting patterns [6]. There have been,
for example, attempts to discover such patterns for molecular
classification [1], [7].
Manuscript received March 25, 2009; revised October 26, 2009. Date ofcurrent version June 3, 2010. Asterisk indicates corresponding author.
W. W. M. Lamis with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).
*K.C. C. Chan iswiththe Department ofComputing,HongKong PolytechnicUniversity, Hung Hom, Hong Kong (e-mail: [email protected]).
Digital Object Identifier 10.1109/TNB.2010.2042609
Given a set of molecular structure data preclassified into a
number of classes, the molecular classification problem is con-
cerned with the discovering of interesting structural patterns in
the data so that unseen molecules not originally in the dataset
can be accurately classified. Effective molecular classification
can uncover relationships between structures and functions, and
can have many applications in many areas, such as drug discov-
ery [8], protein folding [9], comparative genomics [10], cancer-
risk assessment [11], and gene evolution [12].
To tackle the molecular classification problem, two types of
approaches have been used. The first is the more traditional ap-
proach of using what is called the quantitative structure-activity
relationship (QSAR) or the quantitative structure-property rela-
tionship (QSPR) model [35] to derive descriptors from chemi-
cal compounds for classification. The second approach, which
is the more recent approach, is to represent molecular struc-
tures as molecular graphs [13] and to discover frequently occur-
ring subgraphs [14] in them for classification. Both approaches
aim to extract attributes that can best represent the structure
of chemical compounds. The latter approach has recently be-
come more popular as it has been shown that using frequent
subgraph analysis for molecular classification can be better
than using the QSAR/QSPR models [36], [37]. This is because
QSAR/QSPR models cannot be used to map chemical structuredirectly to attribute-based descriptions, such as the internal orga-
nization of chemical compounds. Besides, comparing with the
use of frequent subgraph analysis, QSAR/QSPR requires much
more user intervention and domain knowledge. For this reason,
graph-mining algorithms that can discover frequently occur-
ring subgraphs in larger graphs have recently become popular
(e.g., WARMR [17], Frequent SubGraph discovery (FSG) [18],
Graph-based Substructure PAtterN mining (gSpan) [19], and
GrAph/Sequence/Tree extractiON (GASTON) [42]). These fre-
quently occurring subgraphs represent subgraphs that occur
frequently enough in different classes. The idea of finding fre-
quent subgraphs in different classes of molecular data has beenproposed previously and has been shown to be effective [43].
However, it should be noted that as subgraphs, which occur fre-
quently in one class may also occur frequently in another; the
discovering of frequent subgraphs for molecular classification
may not always be the most effective approach. It does not ex-
plicitly find discriminative subgraphs to allow one class to be
easily discriminated from another. There have been some recent
attempts to find such subgraphs between classes, and they are
defined to be subgraphs that appear more frequently in a certain
positive class than another negative class [15], [16]. However,
how much more frequently should these subgraphs appear for
them to be considered discriminative are not explicitly stated.
1536-1241/$26.00 2010 IEEE
-
8/4/2019 Discovering Interesting Molecular Sub Structure
2/13
78 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
In this paper, we propose a novel graph-mining algorithm for
molecular classification. This algorithm, which is called min-
ing interesting substructures in molecular data for classification
(MISMOC), can discover interesting frequent subgraphs for the
characterization of a molecular class and for the discrimination
of it from one or more of the other classes. However, the clas-
sification problem that MISMOC can tackle is not restricted tobinary classification. MISMOC performs its tasks by first fil-
tering out subgraphs that do not occur frequently enough for
the purpose of classification. By using a test statistic, it then
filters out these frequently occurring subgraphs that only appear
as frequently as expected. Those that remain are subgraphs,
which are interesting in the sense that they not only characterize
a class of molecular graphs, but also allow them to be discrimi-
nated from the others. For each interesting subgraphs, MISMOC
determines a degree of interestingness based on the use of an
information-theoretic measure. When classifying an unseen
molecule that is not in the original dataset, this molecules struc-
ture is matched against the interesting subgraphs in each class
and a total interestingness measure for the unseen molecule tobe classified into a particular class is then determined for the
purpose of classification.
The performance of MISMOCis evaluated with both artificial
and real data. The experimental results show that MISMOC can
discover interesting frequent subgraphs that can both character-
ize and distinguish molecules of one class from the others. It can
also reduce the number of subgraphs that need to be considered
for graph classification by filtering out these subgraphs, which
are not interesting for classification.
The rest of this paper is organized as follows. Section II
presents a review of existing graph-mining algorithms that can
be used for classifying molecular structures. Using an illustra-tive example, Section III describes how frequently occurring
subgraphs can be discovered. Section IV presents the details of
our proposed approach, MISMOC. For illustration, Section V
makes use of an example to demonstrate how MISMOC can ef-
fectively perform molecular classification tasks. In Section VI,
we describe how the performance of MISMOC was evaluated.
The results of the experiments that were carried out are pre-
sented. Finally, Section VII summarizes the work and discusses
possible directions for future research.
II. RELATED WORK
Many graph-mining algorithms have been developed to dis-
cover interesting subgraphs in data with complex structures.
Given such data represented in the form of graphs, these algo-
rithms can be used to mine frequent subgraphs in them. These
frequent subgraphs can then be used to tackle the classification
problem [20], [21].
The graph-mining algorithms based on inductive logic pro-
gramming (ILP), for example, have been used to discover fre-
quent subgraphs for classification [22]. An ILP-based algorithm
called WARMR [17], for example, is able to mine frequent sub-
graphs in graph data that are represented as first-order predicate
logic. ILP-based approaches to graph mining, being based on
predicate logic, have the disadvantages that they may not be
very robust to noisy data. Also, when dealing with real-world
databases that tend to be very large, the computational com-
plexity of these algorithms can be too high to handle. These
approaches have to perform a lot of tests for equivalence in or-
der to prune infrequent and semantically redundant subgraphs.
Other than the ILP-based algorithms, there are quite a number
of other graph-mining algorithms that can be used to discoverfrequent subgraphs. FSG [18], for example, adopts an edge-
based subgraph generation strategy for such purpose. It expands
on a subgraph based on a level-by-level approach [23], first, enu-
merating all frequent single and double-edge subgraphs, and
then, generates larger subgraph iteratively by adding one more
edge to those generated in the previous iteration. For FSG to
perform its tasks, it has to rely on canonical labeling to check
whether a particular subgraph satisfies a support threshold. If
two graphs are isomorphic, their canonical labels are assumed
to be identical. This canonical labeling process for the determi-
nation of graph isomorphism is memory consuming for large
databases.
Other than FSG, gSpan [19] is also a popular graph-miningalgorithm that has been used for graph classification. gSpan
searches for frequent subgraph on graph canonical forms using
a depth-first search (DFS) strategy. It does so by starting from a
randomly chosen vertex, then visiting and marking the vertices
to which this chosen vertex is connected to. This process of
visiting and marking of vertices continues repeatedly until a full
DFS tree is built. For each graph searched, it is possible that
more than one tree be built with DFS depending on the order in
which the vertices were visited. By means of DFS, gSpan is able
to discover all frequent subgraphs without generating candidate
subgraphs and pruning false positives.
Another algorithm for mining of frequent subgraphs is calledGaston [42]. It discovers such subgraphs by first finding fre-
quent paths, then trees, and then, cyclic graphs. It stores all
occurrences of these graphs in an embedding list so that the
frequency of occurrence of a subgraph can be determined by
scanning the embedding list, thereby, improving the speed of
the graph-mining process [43].
MoFa [16] has been used to find frequent subgraphs in graph
data by maintaining parallel embeddings for both vertices and
edges. Like Gaston, each such embedding consists of a set of
references to a molecule that point to the atoms and bonds that
form a subgraph. Such embeddings can be extended so that
larger subgraphs can be formed iteratively [16]. MoFa has been
enhanced later to discover discriminative [40], [41] subgraphs
with relatively higher support and these subgraphs can make
MoFa more suitable approach for graph classification.
Subdue [15] is another graph-mining algorithm that discovers
frequent subgraphs. It makes use of the minimum description
length principle to narrow down possible outcomes when trying
to identify subgraphs that best compress the original graph [45].
The graph-mining algorithms described earlier discover fre-
quent subgraphs by building on smaller subgraphs edge by edge.
Subgraph isomorphism for graph matching is required as a part
of the kernels of these algorithms and this process is known
to be nondeterministic polynomial time (NP)-complete. The
discovering of frequent subgraphs using existing graph-mining
-
8/4/2019 Discovering Interesting Molecular Sub Structure
3/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 79
algorithm requires a frequency threshold to be supplied. If the
threshold is set too small, one may not be able to discover
enough frequent subgraphs to allow graph classes to be distin-
guished from each other. If the threshold is set too large, one
may discover too many frequent subgraphs that are irrelevant for
classification. As subgraphs that appear frequently in one class
of graphs may also do so in another, the discovering of frequentsubgraphs may not always be useful for graph classification.
What is needed for the task is a way to discover subgraphs in a
class that can make it distinguishable from other classes.
In the following, we propose a graph-mining technique called
MISMOC for this purpose. Given a set of frequent subgraphs,
MISMOC can screen out frequent subgraphs that are not useful
for classification and retain those graphs that are useful for the
characterization of molecular classes and the discrimination of
one class from another.
III. ILLUSTRATIVE EXAMPLE
To explain why the discovering of frequent subgraphs maynot always be useful for graph classification, let us consider an
example. We are given three classes of artificial molecular data
shown in Fig. 1.
Each of these three classes of data contains ten molecules and
each molecule consists of atoms connected with bonds. These
molecules are generated in such a way that the atoms are chosen
from 30 possible atoms, including such atoms as carbon (C),
oxygen (O), iridium (Ir), nobelium (No), and thorium (Th), and
bond types from three possible types, including single, double,
and triple bonds. These molecules can be represented as labeled
molecular graphs with each node used to represent an atom and
each edge as a bond.Given the set of graph data as shown in Fig. 1, frequent
subgraphs can be discovered in each of class 1, 2, and 3 us-
ing a graph-mining algorithm, such as FSG, and the unknown
molecule given in Fig. 2 can be predicted. These algorithms
require that a threshold to be given by the users to define how
frequent a subgraph should appear for it to be considered fre-
quent.
For the purpose of illustration, we choose FSG here as graph-
mining algorithms, such as gSpan, does not perform subgraph
pruning. FSG, however, can discover maximal frequent sub-
graphs and can better avoid the problems caused by the discov-
ering of subgraphs, which are too fragmented.
By setting a support threshold of 80% (i.e., any subgraph that
occurs in at least eight out of ten graphs), a number of frequent
subgraphs can be found and theyare given in Table I. It should be
noted that the same frequent subgraph, a nitrogen atom double-
bonded with an oxygen atom (i.e., N==O), appears in 80% ofthe graphs in each of the three classes (see Table I).
Since the choice of threshold does not allow any unique fre-
quent subgraph to be discovered for each class, we lower the
support threshold by 10%. The results are shown in Table II.
More frequent subgraphs are discovered this time when the
support threshold is lowered to 70%. However, the newly dis-
covered frequent subgraphs for class 2 and 3 are still the same
and a graph with such subgraphs may be classified into either
Fig. 1. Training molecular data.
class 2 or 3. This means that the discovered frequent subgraph
cannot allow graphs in class 2 to be easily discriminated from
class 3.
When the support threshold is further lowered to 60%,
more frequent subgraphs are discovered and they are shown in
Table III. Unfortunately, the newly discovered frequent sub-
graphs, for each of the three classes, still overlap with each
other. A graph characterized by these subgraphscan be classified
-
8/4/2019 Discovering Interesting Molecular Sub Structure
4/13
80 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
Fig. 2. Unknown molecule.
TABLE IMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 80%)
TABLE IIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 70%)
TABLE IIIMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 60%)
into one or more classes. For example, if a graph G is character-
ized by the subgraph , it can be classified into either class
2 or 3. If G is characterized by the subgraph , it can be
classified into either class 1 or 2. If G is characterized by both
TABLE IVMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 50%)
and , then there is a chance that it can be classified
into any of class 1, 2, or 3 as appears six times in class
1 and 2, and appears seven times in class 2 and 3.
To find more interesting and useful frequent subgraphs for
classification, the support threshold is further lowered to 50%.
Using the FSG again, the frequent subgraphs discovered are
shown in Table IV. This time, many more frequent subgraphs
-
8/4/2019 Discovering Interesting Molecular Sub Structure
5/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 81
Fig. 3. Classifying the unseen molecule in Fig. 2 with FSG.
are discovered and some of the subgraphs discovered in each of
S(1 ) , S(2 ) , and S(3 ) do not overlap with each other.
If we have to classify the testing sample in Fig. 2, it should be
noted that this graph is characterized by three frequent subgraphs
, , and from each of S(1 ) , S(2 ) , and S(3 ), respec-
tively (see Fig. 3). It is, therefore, hard to decide to which class
this graph should be classified into, based on these subgraphs
that it contains. If one is to take a closer look at the frequency
of appearance of each of these three subgraphs , , and
, in each class, one may discover that even though is
not frequent enough in class 2 and 3, it appears in 40% of the
graphs in these classes. This is the case also with . Although
it only appears in 10% of the graphs in class 1, it appears in 40%
of the graphs in class 2. Of these three subgraphs, , is themost interesting and unique in the sense that it appears in 50%
of the graphs in class 2, it only appears in 10% of the graphs in
both class 1 and 3. In other words, this subgraph provides more
evidence for a graph it characterizes to be classified into class 2
than other subgraphs. In fact, it is for this reason that the graph
in Fig. 2 belongs more likely to class 2 than any other classes.
In order to discover more frequent subgraphs that may be use-
ful for classifying the unseen molecule, the support threshold is
further reduced to 40%, and the new frequent subgraphs are dis-
covered, as shown in Table V. The newly discovered subgraphs
are S(1 )8 , S
(1 )9 , S
(1 )10 in class 1; S
(2 )7 , S
(2 )8 , S
(2 )9 , S
(2 )10 in class 2; and
S(3 )7 , S(3 )8 , S(3 )9 in class 3. Although the support threshold is low-ered to 40%, these subgraphs are all appeared more frequently
in the other classes, for example, S(1 )8 is previously discovered
as frequent subgraph S(2 )2 in class 2 and S
(3 )2 in class 3. The case
is the same as the others. We tried to further reduce the support
threshold to 30%, but the case is still the same that the newly
discovered subgraph is already found at higher threshold value.
The actual relative frequency of appearances of each frequent
subgraph in each class may therefore provide useful informa-
tion for classification. The idea that MISMOC uses to filter out
uninteresting and irrelevant frequent subgraphs to allow molec-
ular classification to be performed effectively is, therefore, to
take into consideration such information so as to measure the
TABLE VMAXIMAL FREQUENT SUBGRAPHS (SUPPORT THRESHOLD = 40%)
-
8/4/2019 Discovering Interesting Molecular Sub Structure
6/13
82 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
relatively interestingness of each frequent subgraph relative to
the others.
IV. MISMOC: A GRAPH-MINING TECHNIQUE FOR
MOLECULAR CLASSIFICATION
The molecular classification problem, which this paper ad-
dresses can be stated more formally as follows. Given a set of
molecular structure data G, containing n molecules preclassifiedintop classes, the molecular classification problem is concernedwith the discovering of interesting patterns in the data to allow
unseen graphs not originally in G to be correctly classified
into one of the p classes.The n molecules in G can be represented as n molecular
graphs G1 , G2 , . . . , Gn , where Gi = Gi (Vi , Ei ), i {1, . . . , n}is a labeled graph with vertices representing atoms and edges
representing bonds between atoms.
For applications in bioinformatics, the molecular graphs can
be generalized so that the vertices can represent molecules, such
as amino acids, and the edges can represent the chemical bondsthat connect the molecules. The p classes that the n moleculesand their corresponding molecular graphs are classified into can
be represented as (1 ) , . . . , (p) , where (i) = {G(i)1 , . . . , G(i)c i } G, i = 1, . . . , p.
In the following, we present the details of a MISMOC tech-
nique, which can be used to effectively improve the accuracy
of graph classification. MISMOC performs its tasks in several
stages. It first searches for frequent subgraphs using an existing
algorithm, such as FSG or gSpan. Since a subgraph that ap-
pears frequently in one class may also does so in another, not
all frequent subgraphs are useful and interesting for classifica-
tion. To screen out the uninteresting ones, MISMOC makes useof a test statistics to distinguish interesting subgraphs from the
uninteresting ones.
Once the interesting frequent subgraphs are identified, the
interestingness of each of these frequent subgraphs is then mea-
sured based on an information theoretic measure called the
weight of evidence. This measure can be combined to form
an overall total interestingness measure for the purpose of clas-
sifying an unseen graph.
A. Discovering Frequent Subgraphs
To discover frequent subgraphs in a graph database, there
are several graph-mining algorithms to choose from. ForMISMOC, users can choose between two commonly used
graph-mining algorithms FSG [18] and gSpan [19]. Given the
dataset = {G1 , . . . , Gi , . . . , Gn } as described earlier, one canuse either of these algorithms to discover a set of frequent sub-
graphs (1 ), . . . , (i) , . . . , (p) , where (i) = {S(i)1 , . . . , S(i)n i },i = 1, . . . , p, for each of the corresponding p classes (1 ), . . . ,
(i) , . . . , (p) .1) FSG Algorithm: The FSG algorithm can find all frequent
subgraphs in each class of molecular graphs using the Apriori
algorithm [23]. It does so by treating edges in the graphs as
items in transactions so that the Apriori algorithm can be used to
discover frequent subgraphs, like it is used to discover frequent
Fig. 4. Algorithm of FSG.
itemsets, i.e., in the same way the Apriori algorithm increases
the size of frequent itemsets by adding a single item at a time,
the FSG algorithm also increases the size of frequent subgraphs
by adding an edge one by one.
Briefly, the FSG can be described as follows. For each (i) ,
i = 1, . . . , p, FSG first finds a set of frequent one-edge sub-graphs and a set of frequent two-edge subgraphs. Then, based
on these two sets of intermediate subgraphs, it starts to itera-
tively generate candidate subgraphs, whose size is greater than
the previous frequent subgraphs by one edge. FSG then counts
the frequency for each of these candidates and prunes subgraphsthat do not satisfy the support threshold . The qualified sub-graphs are further expanded and their frequencies are verified
with the same support condition to prune the lattice of frequent
subgraphs. The final set of frequent subgraphs (1 ), . . . (i) , . . . ,(p ) , where (i) contains all frequent k-subgraphs, is generated
for each class. Let gk be a k-subgraph with k edges, k be aset of candidate subgraphs with k edges, k (i) be a set of fre-quent k-subgraphs for class (i) , the algorithm of FSG can besummarized in Fig. 4 [18].
2) gSpan Algorithm: The gSpan algorithm [19] discovers a
set of frequent subgraphs for each graph class by mapping each
graph in the class to a unique minimum DFS code as the canon-
ical label. Firstly, gSpan sorts all vertices and edges in the set
of graph transactions in each class according to their frequency
of occurrence and removes the infrequent vertices and edges
from (i) . The remaining vertices and edges are relabeled and
sorted in descending frequency. 1( i) is then formed by all fre-
quent one-edge subgraphs and it acts as the seed for generating
more children. The subprocedure called SubgraphMiner, which
expand each one-edge frequent subgraph 1( i) from each class
by adding one edge at a time. In the SubgraphMiner, ifs is theminimum DFS code of the graph it represents, it adds s to its fre-quent subgraph set (i) . It then generates all potential children
with a one-edge growth and runs SubgraphMiner recursively for
each child. After this, the edge is removed from each graph in
-
8/4/2019 Discovering Interesting Molecular Sub Structure
7/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 83
Fig. 5. Algorithm of gSpan.
(i) when all the descendants of this one-edge graph have been
searched. When all frequent k-subgraphs and their descendantsare generated, the final set of frequent subgraphs (i) , i = 1, . . . ,p, will be generated for each class. The algorithm of gSpan canbe summarized in Fig. 5 [19].
B. Discovering Interesting Frequent Subgraphs
Using MISMOC
FSG and gSpan aim at discovering frequent subgraphs (i) =
{S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i = 1, . . . , p, in each of the corre-sponding graph class (1 ) , . . . , (i) , . . . , (p) . These algorithmsare not originally developed for graph classification. Hence,
while the discovered frequent subgraph can characterize each
graph class, they may not be very useful in discriminating one
class from another. This is because a frequent subgraph, which
appears frequently in one graph may also do so in another and
such frequent subgraphs are not interesting for classification. Inthis section, we present a methodology that MISMOC uses to
identify interesting subgraphs that are interesting and useful for
classification. This methodology is based on the use of a test
statistic [24][26] and its details are given in Fig. 6.
Once the set of frequent subgraphs (i) , i = 1, . . . , p, arediscovered for each of (i) , i = 1, . . . , p, respectively, the prob-ability that a graph G is in (i) , i {1, . . . , p} given that G is
Fig. 6. Algorithm of MISMOC.
characterized by a frequent subgraph S(i)
j (i) , j {1, . . . ,ni} can be determined as follows:
PrG (i)
|G is characterized by S
(i)j
=total no. of graphs in (i) that are characterized by S
(i)j
total no. of graphs in that are characterized by S(i)
j
.
(1)
If Pr(G (i) |G is characterized by S(i)j ) is not much differ-ent from Pr(G (i) ), i.e., whether or not G is characterized byS
(i)j makes very little difference, then S
(i)j should not be consid-
ered very interesting in determining, if G should be classified
into (i) . Otherwise, S(i)
j can be very interesting.
To objectively determine if the two probabilities are different,
we make use of a test statistic [24][26], dj i , which is definedas follows:
dj i =zj ij i
(2)
where zj i is defined as (3), shown at the bottom of this page,and ij is the maximum likelihood estimate of the variance of
Pr
G (i) |G is characterized by S(i)j n Pr G (i) )Pr(G is characterized by S(i)j
n Pr
G
(i)
Pr
G is characterized by S(i)
j (3)
-
8/4/2019 Discovering Interesting Molecular Sub Structure
8/13
84 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
zj i and is given by
j i = (1 Pr(G (i) ))
1 PrG is characterized by S(i)j
.(4)
Based on [24], if|dj i | > 1.96, we can conclude that the differ-ence between Pr(G
(i)
|G is characterized by S
(i)j ) is signifi-
cantly different from Pr(G (i) ), and therefore, the subgraphS
(i)j is interesting and useful for classification. Ifdj i > +1.96, it
implies that the presence ofS(i)
j in a graph G provides evidence
supporting G to be classifiedinto (i) , otherwiseifdj i < 1.96,it implies that thepresenceof thefrequent subgraph S
(i)j provides
negative evidence against G to be classified into (i) . In either
case, S(i)
j can be considered as interesting frequent subgraph.
With the use of the test statistics, MISMOC screens each set
of frequent subgraphs (i) = {S(i)1 , . . . , S(i)j , . . . , S(i)n i }, i =1, . . . , p, to retain only those who are interesting. The set ofinteresting frequent subgraph discovered for each of (1 ), . . . ,
(i) , . . . , (p) is denoted as (i) = {S(i)
1 , . . . , S(i)
j , . . . , S(i)
n i },i = 1, . . . , p, and ni < n i , respectively.
C. Interestingness Measure as a Function of the
Weight of Evidence
The interesting frequent subgraphs provide positive or nega-
tive evidence supporting or refuting the classification of a graph
into a particular class. MISMOC measures how interesting these
frequent subgraphs are with the use of an interestingness mea-
sure defined in terms of an information-theoretic weight of evi-
dence measure.
The more interesting a frequent subgraph is for a class, thegreater the difference is between the two probabilities of Pr(G
(i) |G is characterized by S(i)j ) and Pr(G (i) ). Hence, theinterestingness measure is defined again as a function of these
two probabilities. Specifically, the more interesting S(i)
j is, the
greater is the ratio between Pr(G (i) |G is characterized byS
(i)j ) and Pr(G (i) ). This ratio can be measured with a
mutual information measure I(G (i) :G is characterized byS
(i)j ), between G (i) and G is characterized by S(i)j as
follows:
I(G
(i) : G is characterized by S(i)
j
)
= logPr
G (i) |G is characterized by S(i)j
Pr(G (i) ) . (5)
Based on the mutual information measure, the weight of ev-
idence provided by S(i)
j for or against the classification of G
into (i) can be defined as follows:
W(i) (G|S(i)j ) = W(G (i)/G / (i) |G is characterized by S(i)j= I(G (i) : G is characterized by S(i)j )
I(G / (i) : G is characterized by S(i)
j ). (6)
W(i) (G|S(i)j ) can be interpreted as a measure of the differencein the gain in information when a graph G that contains S
(i)j is
classified into (i) , as opposed to other classes. W(i) (G|S(i)j )is positive if S
(i)j provides positive evidence supporting the
classification of G into (i) , otherwise it is negative.
D. Classification Using a Total Interestingness Measure
Given the interesting frequent subgraphs (1 ), . . . ,(i) , . . . , (p) , discovered for each corresponding p classes(1 ) , . . . , (i) , . . . , (p ) , an unseen graph G not originally
in , can be classified by matching it against the subgraphs in
each of (i) , i = 1, . . . , p.For every subgraph, S
(i)j (i) that G matches, there is some
evidence W(i) (G|S(i)j ) provided by it for or against the classi-fication of G into (i) . Assuming that G matches with mi niinteresting frequent subgraph in (i) , s(i)1 , . . . , s
(i)j , . . . , s
(i)m i
(i) , MISMOC then computes a total interestingness measurefor G to be classified into (i) . This total interestingness mea-
sure is defined as the summation of the total weight of evidence
provided by each individual interesting frequent subgraph s(i)
j
for or against G to be classified into (i) as follows:
W(i) (G) = W(G (i) /G / (i) |Gis characterized by s
(i)1 , . . . s
(i)j , . . . , s
(i)m i )
=
m ij = 1
W(G (i) /G / (i) |G
is characterized by s(i)j ). (7)
The value of W(i) (G| S(i)j ) increases with the number andstrength of the matched subgraphs in s
(i)1 , . . . , s
(i)j , . . . , s
(i)m i
that provide positive evidence supporting G to be classified
into (i) , whereas the value of W(i) (G|S(i)j ) decreases if somematched subgraphs provide negative evidence refuting the clas-
sification of G into (i) . The total interestingness measure for
G to be classified into each of (1 ), . . . , (i) , . . . , (p) is de-termined and MISMOC assigns G to the class, which gives the
greatest total interestingness measure.
Compared to algorithms that classify graphs by consider-
ing only frequent subgraphs, MISMOC has the advantages that
it discovers frequent subgraphs that are considered interesting
by an objective statistical evidence measure. Instead of relying
solely on the appearance of frequent subgraph during classifica-
tion, MISMOC takes into consideration only those, which are
useful and interesting. These frequent subgraphs are unique and
can have biological meaning. The other frequent graph-mining
algorithms like FSG and gSpan can only handle single class of
data, if there are two or more classes, the comparative effect of a
subgraph across all classes are ignored. There is always a chance
that two or more classes have the same frequent subgraph. With
interestingness measure, we can distinguish interesting frequent
subgraphs from uninteresting ones for multiple classes.
-
8/4/2019 Discovering Interesting Molecular Sub Structure
9/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 85
V. ILLUSTRATIVE EXAMPLE CONTINUED
To illustrate how MISMOC works,let us consider theexample
in Section III again. Given the frequent subgraphs discovered
using FSG at a support threshold of 50%, MISMOC obtains
for each of the 15 frequent subgraphs with their frequency of
occurrences in each class (see Table V). It then screens for all
frequent subgraphs that are interesting using the test statisticsgiven by (2). The value of the test statistics for each frequent
subgraph in each class are given also in Table VI.
As described in the last section, subgraphs with |dj i < 1.96|will be filtered out, and the remaining subgraphs will form a set
of interesting subgraphs for graph classification. Since d41 , d51 ,d62 , d72 , d83 , and d93 are greater than 1.96, we conclude that of
all 15 frequent subgraphs discovered, only S(1 )4 and S
(1 )5 , S
(2 )6
and S(2 )7 , and S
(3 )8 and S
(3 )9 are interesting frequent subgraphs
for each of class 1, 2, and 3, respectively.
Given these interesting frequent subgraphs, we can classify
the test graph shown in Fig. 2 by computing the total weight of
interestingness measure for it to be classified into each class.Using (1) to (7)
W(1 ) (G) = W
Class = 1/Class = 1|S(1 )7 , S(2 )6 , S(3 )5
= W
Class = 1/Class = 1/S(2 )6
= 1.5018.Similarly, W(2 )(G) = 2.2288 and W(3 )(G) = 1.5732. As
the value of W(2 )(G) in class 2 is the largest of all, we canconclude that the new sample belongs to class 2. Besides, there
is negative evidence against the test graph being classified in
class 1 and 3, therefore, the new sample is not likely to belong
to class 1 or 3.
VI. EXPERIMENTS AND RESULTS
To evaluate the effectiveness of MISMOC, it is tested us-
ing both artificial and real data. We compared its performance
with that of two graph classification algorithms based on FSG
and gSpan. For experimentation, we used the executable files of
these algorithms available from [27] and [28], respectively. The
classification results were obtained using tenfold cross valida-
tions with an implementation of support vector machine (SVM)
available at [29].
A. Performance Evaluation
The performance of a classifier is usually evaluated by the use
of average classification accuracy and the results are typically
presented in a confusion matrix (see Table VII), which has four
entries: the number of true positive cases (TP), true negative
cases (TN), false positive cases (FP), and false negative cases
(FN), and the average accuracy is calculated as follows [30]:
Average Accuracy =TP + TN
TP + FN + FP + TN. (8)
While evaluation based on the use of the classification accuracy
measure may be popular, it may not always be very appro-
priate for classification problems involving imbalanced class
TABLE VIINTERESTINGNESS MEASURE OF FREQUENT SUBGRAPHS ( = 50%)
TABLE VIICONFUSION MATRIX
-
8/4/2019 Discovering Interesting Molecular Sub Structure
10/13
86 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
distributions. When TN is much greater than TP, (FP + TN) isalso much greater than (TP + FN). In such case, the success-fully predicted cases in the minority positive class will play a
role that can be too insignificant when the average accuracy rate
is determined and the minority cases will be treated as noise
even if they are supposed to be important. In order to overcome
this problem, the true positive and false positive rates need tobe monitored separately, using (9) and (10) when test data are
being classified
True positive rate =TP
TP + FN(9)
False positive rate =FP
FP + TN. (10)
These rates measure the performance of a classifier for each
class and the objective is to keep the true positive rate as high
as possible and the false positive rate as low as possible. Some-
times, the true positive rate is called recall or sensitivity, and the
false positive rate is called false alarm rate. In order to transformthis multiobjective problem into a single-objective equivalent,
the receiver operating characteristic (ROC) analysis [31] has
been proposed and is becoming more and more popular when
the training data size for different classes of data are very dif-
ferent. With the ROC analysis, the true positive rate is plotted
along the y-axis against the false positive rate along the x-axisto form a ROC curve, and the objective is to maximize the value
of AUC, which stands for the area under the ROC curve. The
value of AUC is always between 0.0 and 1.0. An area of 1 repre-
sents a perfect classification, whereas an area of 0.5 represents a
worthless classification that is equivalent to a random guess in a
two-class classification problem. The AUC reflects very well theprobability that a classifier ranks, a randomly chosen positive
instance higher than a randomly chosen negative instance. In
this paper, as the datasets that we use differ significantly in class
sizes, we use the AUC to evaluate the performance of different
classifiers on different datasets.
B. Datasets
The first dataset is a set of binary-class artificial data that
are generated with GraphGen [32]. The artificial datasets are
generated with a set of parameters: 1) the total number of trans-
actions (-ngraphs); 2) the average size of each graph (-size);
3) the number of unique node labels (-nnodel); 4) the number
of unique edge labels (-nedgel); 5) the average density of each
graph (-density); 6) the number of unique edges in the whole
dataset (-nedges); and 7) the average edge ratio of each graph
(-edger). The parameter 1, 4, 5, 6, and 7 are fixed to 5000, 10,
0.3, 100, and 0.2, respectively, and we vary the remaining pa-
rameters to generate four datasets as given in Table VIII with
properties given next.
The second dataset is collected from predictive toxicology
challenge (PTC) [33] that contains the carcinogenicity of 417
chemical compounds on four types of rodents: male rats (MR),
female rats (FR), male mice (MM), and female mice (FM). Each
of these datasets can be considered as consisting of two classes
TABLE VIIIARTIFICIAL DATASET WITH DIFFERENT PARAMETERS
TABLE IXPROPERTIES OF THE EXPERIMENTAL DATASETS
TABLE XCLASSIFICATION PERFORMANCE FOR FSG AND MISMOC
of data [39]: those with positive evidence of cancerous growth
and those with negative evidence.
The third dataset is collected from the Estrogen Receptor
Binding (NCTR ER) Database in the Distributed Structure-
Searchable Toxicity (DSSTox) Public Database Network of the
National Center for Toxicological Research [34]. The database
covers most known estrogenic classes and it is a structurally di-
verse set of estrogens. The NCTR ER database consists of 224
chemical compounds with each classified as active or inactive
with respect to the attribute ActivityOutcome_NCTRER. A
compound is active if the measure of activity of the compound
is active strong, medium, or weak. It is inactive if there is no
activity for that compound. The properties of the datasets we
used in our experiments are listed in Table IX.
C. Performance Analysis
For performance comparison, we tested all datasets using
first two algorithms of FSG, gSpan, and then, compare their
performance when MISMOC is used. Tables X and XI show the
performance of different algorithms on the different datasets.
For easier comparisons, we use a single misclassification cost
value of 3.0 and as suggested in [38] for the SVM classifier.
-
8/4/2019 Discovering Interesting Molecular Sub Structure
11/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 87
TABLE XICLASSIFICATION PERFORMANCE FOR gSPAN AND MISMOC
Forour experiments, as a high threshold mayresultin toolittle
and a low threshold may result in too many of the frequently oc-
curring subgraphs beingdiscovered, and as the support threshold
is proportional to the runtime and memory consumption [43],[44], we tried different support thresholds ranging from 90% to
2% and decided to settle at 3% for the artificial dataset, 5% for
the PTC dataset, and 10% for the NCTR ER dataset for both
the experiments with FSG and gSpan. These settings allow us
to obtain a good size of subgraphs (i.e., 50 n 500) for theidentification of the interesting ones.
Given these settings of the support thresholds, the average
AUC for each algorithm is determined and shown in the table.
From these results, we can see that the classification perfor-
mance (average AUC) of FSG and gSpan are similar. The av-
erage AUC for them are 0.683 and 0.691, respectively. After
applying MISMOC to these frequent subgraph discovery algo-rithms, their average AUC improved by 14.44% and 14.05%,
respectively.
These results show that the performance of FSG and gSpan
can be improved with the two-phase approach that MISMOC
adopts. The subgraphs discovered by many graph-mining al-
gorithms may appear frequently in a class, but they may not
uniquely represent a class. Subgraphs that may not appear very
frequently can play an important role in discriminating one class
from another. With MISMOC, the relative frequency of each
subgraph is considered and how useful they are for classifica-
tion are determined with a measure. The measure is then used
when a graph is classified. This makes MISMOC more effective
as a graph-classification algorithm.
The datasets D1 to D4 are the artificial dataset with varied
size of graph samples and number of unique node labels. When
the number of unique node labels is increased from 5 to 10,
we can see that the classification performance is higher for D2
with more unique node labels than D1 with less unique node
labels, the case is the same for D3 and D4. The reason is that the
combination of the discovered frequent subgraphs will be less
if the number of unique node labels is small. For example, if
there are only two node labels v1 and v2 in the dataset, we have
only three combinations (v1 v1 , v1 v2 , and v2 v2 ) for a graph
with two vertices and one edge; if there are five node labels v i ,
where i = 5 in the dataset, we can have 15 combinations. In the
case with less unique node labels, many frequent subgraphs will
be the same for both positive and negative class. These frequent
subgraphs are uninteresting and not useful in discriminating
the graph sample into different classes. With MISMOC, we
can filter these uninteresting frequent subgraphs to increase the
classification performance. Hence, we can observe from the
results that the average AUC of D1 is lower than that of D2,and the AUC is increased more significant in D1 than D2 after
applying MISMOC. When thesize of graph samples is increased
from 10 to 30, we can see that the classification performance is
lower for D4 with larger graph size than D2 with smaller graph
size, the case is the same for D1 and D3. The reason for this is
that a large graph will contain more noise than a small graph
as the interesting subgraph(s) usually contribute a small part in
a graph. From the results, we can see that the average AUC of
D4 is lower than D2, and MISMOC helps to remove these noisy
frequent subgraphs and increase the AUC more significantly in
D4 than D2 as the graph size in D4 is larger than that of D2.
The PTC dataset contains four datasets: MR, FR, MM, and
FM. The average AUC of FM is the highest and that of FR isthe lowest. This may be due to the percentage of the positive
samples of FM (38.1%) being higher than that of FR (31.1%).
The overall AUC for the PTC dataset is 0.58 when applying FSG
and gSpan, and this value has increased to 0.63 with MISMOC.
The overall AUC is still relatively low even when MISMOC
is used and this may be due to some structural features in the
test set, not being present in the training set. This is the main
reason that the classification performance is quite low. This
phenomenon is also mentioned in the evaluation report of [33].
The NCTR ER dataset has the highest AUC throughout the
experimental datasets. The average AUC for FSG and gSpan
is 0.844 and this is increased to 0.939 with MISMOC. Thismeans that the ER compounds contain distinguishing structures
for active and inactive classes. The discovered interesting fre-
quent subgraphs can be used to characterize a class of estrogen
as well as to discriminate it from other classes. From the per-
centage of improvement in AUC, we can observe that the noisy
and uninteresting frequent subgraphs are effectively ignored by
MISMOC and the AUC is maximized when it is used with FSG
and gSpan.
VII. CONCLUSION
In this paper, we introduced a new graph-mining technique
called MISMOC to discoverinterestingfrequent subgraphs from
graph databases. It is evaluated with both artificial and real
datasets, and the experimental results show that MISMOC can
work very well with large and complex datasets and can improve
the classification performance of the existing graph-mining
algorithms.
The frequent subgraphs of real biological datasets usually
contain many common vertices [e.g., carbon (C) and oxygen
(O)] and edges (e.g., single hydrogen bond). For this reason,
both positive and negative samples may contain the same set
of frequent subgraphs. The frequent subgraphs discovered by
existing graph-mining algorithms may, therefore, not be very
useful for molecular classification. MISMOC is able to achieve
-
8/4/2019 Discovering Interesting Molecular Sub Structure
12/13
88 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010
a higher accuracy as it aims to discover interesting subgraphs
that do not just occur more frequently but can also allow graph
classes to be better discriminated from one another. MISMOC
can better handle the problem of having too many frequent sub-
graphs when support thresholds are lowered. Like other graph-
mining algorithm, the size and number of graphs that MISMOC
can handle can be very large and they are limited mainly bycomputing hardware.
The next version of MISMOC will include an algorithm that
can discover interesting subgraphs that may not occur frequently
enough. However, it will not be relying on a frequent-subgraph-
mining algorithm in the first phase. In order to facilitate un-
derstanding, it will also try to better identify graphs that are
maximal and less fragmented. In addition, it will represent
graph in a more flexible structure so that graph that are similar
can be represented in the same subgraph. The next release of
MISMOC is expected to take into consideration topological in-
dexes of the discovered structure so as to allow graph classes to
be distinguished more easily from each other.
REFERENCES
[1] D. Conklin, S. Fortier, and J. Glasgow, Knowledge discovery in molecu-lar databases, IEEE Trans. Knowl. Data Eng., vol. 5, no. 6, pp. 985987,Dec. 1993.
[2] T. Barrett, T. O. Suzek, D. B. Troup, S. E. Wilhite, W. C. Ngau, P. Ledoux,D. Rudnev, A. E. Lash, W. Fujibuchi, and R. Edgar, NCBI GEO: Miningmillions of expression profilesDatabase and tools, Nucleic Acids Res.,vol. 33, pp. D562D566, 2005.
[3] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello,M. Helmer-Citterich, and G. Cesareni, MINT: A molecular INTeraction database,FEBS Lett., vol. 513, no. 1, pp. 135140, 2002.
[4] K. Arnold, L. Bordoli, J. Kopp, and T. Schwede, The SWISS-MODELWorkspace: A Web-based environment for protein structure homologymodeling, Bioinformatics, vol. 22, pp. 195201, 2006.
[5] L. Holm, C. Ouzounis, C. Sander, G. Tuparev, andG. Vriend, A databaseof protein structure families with common folding motifs, Protein Sci.,vol. 1, pp. 16911698, 1992.
[6] M. Ebeling and S. Suhai, Molecular databases on the internet, J. Mol.Med., vol. 75, pp. 620623, 1997.
[7] A. Sperduti and A. Starita, Supervised neural networks for the classifica-tion of structures, IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 714735,May 1997.
[8] C.A. Lipinski,F.Lombardo,B. W. Dominy,and P. J.Feeney,Experimen-tal and computational approaches to estimate solubility and permeabilityin drugdiscovery and development settings, Adv. Drug Del. Rev., vol.46,pp. 326, 2001.
[9] L. A. Mirny and E. I. Shakhnovich, Universally conserved positions inprotein folds: reading evolutionary signals about stability, folding kineticsand function, J. Mol. Biol., vol. 291, no. 1, pp. 177196, 1999.
[10] A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray,
F. Waldman, and D. Pinkel, Comparative genomic hybridization formolecular cytogenetic analysis of solid tumors, Science, vol. 258,no. 5083, pp. 818821, 1992.
[11] M. G. Dunlop, S. M. Farrington, A. D. Carothers, A. H. Wyllie, L. Sharp,J. Burn, B. Liu, K. W. Kinzler, and B. Vogelstein, Cancer risk associatedwith germline DNA mismatch repair gene mutations, Hum. Mol. Genet.,vol. 6, pp. 105110, 1997.
[12] L. Nakhleh, T. Warnow, C. R. Linder, and K. St. John, Reconstructingreticulate evolution in speciesTheory and practice, J. Comput. Biol.,vol. 12, no. 6, pp. 796811, 2005.
[13] J. A. Bondy, Graph Theory With Applications. New York: Elsevier,1976.
[14] Y. Yoshida, Y. Ohta, K. Kobayashi, and N. Yugami, Mining interestingpatterns using estimated frequencies from subpatterns and superpatterns,
Lecture Notes in Computer Science, vol. 2843, pp. 494501, 2003.[15] L. B. Holder, D. J. Cook, and S. Djoko, Substructure discovery in the
SUBDUE system, in Proc. AAAI Workshop Knowl. Discov. Databases,
1994, pp. 169180.
[16] C. Borgelt and M. R. Berthold, Mining molecular fragments: Findingrelevant substructures of molecules, in Proc. 2nd IEEE Int. Conf. Data
Mining (ICDM), 2002, pp. 5158.[17] R. D. King, A. Srinivasan, andL. Dehaspe, Warmr: A data miningtoolfor
chemical data, J. Comput.-Aided Mol. Des., vol. 15, no. 2, pp. 173181,2001.
[18] M. Kuramochi and G. Karypis, Frequent sub-graph discovery, in Proc.1st IEEE Int. Conf. Data Mining (ICDM), 2001, pp. 313320.
[19] X. Yan and J. Han, gSpan: Graph-based substructure pattern mining, inProc. IEEE Int. Conf. Data Mining, 2002, pp. 721724.
[20] I. Fischer and T. Meinl, Graph-based molecular data mining Anoverview, in Proc. IEEE Int. Conf. Syst., Man Cybern., 2004, vol. 5,pp. 45784582.
[21] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, Frequentsubstructure-based approaches for classifying chemical compounds,
IEEE Trans. Knowl. Data Eng., vol. 17, no. 8, pp. 10361050, Aug.2005.
[22] S. H. Muggleton, Inductive logic programming, N. Gen. Comput.,vol.8,no. 4, pp. 295318, 1991.
[23] A. Inokuchi, T. Washio, and H. Motoda, An apriori-based algorithm formining frequent substructures from graph data, in Proc. 4th Eur. Conf.Principles Pract. Knowl. Discov. Databases (PKDD), 2000, pp. 1323.
[24] K. C. C. Chan, A. K. C. Wong, and D. K. Y. Chiu, Learning sequentialpatterns for probabilistic inductive prediction, IEEE Trans. Syst., ManCybern., vol. 24, no. 10, pp. 15321547, Oct. 1994.
[25] K. C. C. Chan and A. K. C. Wong, APACS: A system for automatedpattern analysis andclassification, Comput. Intell.: Int. J.,vol.6,pp.119131, 1990.
[26] P. C. H. Ma and K. C. C. Chan, UPSEC: An algorithm for classifyingunaligned protein sequences into functional families, J. Comput. Biol.,vol. 15, no. 4, pp. 431443, 2008.
[27] FSG, Karypis Lab, version 1.0.1. (2003). [Online]. Available:http://www-users.cs.umn.edu/karypis/pafi
[28] gSpan, Illimine, version 1.1.1. (2006). [Online]. Available:http://illimine.cs.uiuc.edu/download/index.php
[29] C. C. Chang and C. J. Lin. (2001) LIBSVM: A library for support vectormachines [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm
[30] S. Daskalaki, I. Kopanas, and N. Avouris, Evaluation of classifiers foran uneven class distribution problem, Appl. Artif. Intell., vol. 20, no. 5,pp. 381417, 2006.
[31] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett.,vol. 27, pp. 861874, 2006.
[32] J. Cheng, Y. Ke, and W. Ng. (2006) GraphGen: A graph synthetic gener-ator [Online]. Available: http://www.cse.ust.hk/graphgen/
[33] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg, Thepredictive toxicology evaluation challenge, presented at the 15th IJCAI,Los Angeles, CA, 1997.
[34] W. Tong, H. Fang, C. R. Williams, J. M. Burch, and A. M. Richard.(2008) DSSTox FDA National Center for Toxicological Research Es-trogen Receptor Binding Database (NCTRER): SDF files and web-site documentation,NCTRER_v4b_232_15Feb2008[Online]. Available:www.epa.gov/ncct/dsstox/sdf_nctrer.html
[35] J. Devillers and A. T. Balaban, Topological Indices and Related Descrip-tors in QSAR and QSPR. Boca Raton, FL: CRC Press, 1999.
[36] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg,Structure-activity relationships derived by machine learning: The use ofatoms and their bond connectivities to predict mutagenicity by induc-tive logic programming, Proc. Nat. Acad. Sci., vol. 93, pp. 438442,
1996.[37] A. Sriniviasan and R. King, Feature construction with inductive logic
programming: A study of quantitative predictions of biological activityaided by structural attributes, J. Knowl. Discov. Data Mining, vol. 3,pp. 3757, 1999.
[38] M. Deshpande and G. Karypis, Automated approaches for classifyingstructure, in Proc. 2nd ACM SIGKDD Workshop Data Mining Bioinf.,2002, pp. 1118.
[39] S. Menchetti, F. Costa, and P. Frasconi, Weighted decomposition ker-nels, in Proc. 22nd Int. Conf. Mach. Learning, Bonn, Germany, 2005,pp. 585592.
[40] T. Meinl, C. Borgelt, and M. R. Berthold, Discriminative closed fragmentmining and perfect extensions in MoFa, in Proc. 2nd Starting AI Res.Symp. (STAIRS), Valencia, Spain, 2004, pp. 314.
[41] C. Borgelt, H. Hofer, and M. Berthold, Finding discriminative molecu-lar fragments, presented at the Workshop Inf. Mining Navigat. LargeHeterogen. Spaces Multimedia Inf. German Conf. Artif. Intell., Hamburg,Germany, 2003.
-
8/4/2019 Discovering Interesting Molecular Sub Structure
13/13
LAM AND CHAN: DISCOVERING INTERESTING MOLECULAR SUBSTRUCTURES FOR MOLECULAR CLASSIFICATION 89
[42] S. Nijssen and J. N. Kok, Frequent graph mining and its application tomolecular databases, in Proc. IEEE Conf. Syst., Man Cybern. (SMC) ,W. Thissen, P. Wieringa, M. Pantic, and M. Ludema, Eds. Den Haag, TheNetherlands, 2004, pp. 45714577.
[43] M. Worlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitativecomparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston,in Proc. 9th Eur. Conf. on Principles Pract. Knowl. Discov. Databases(PKDD), Porto, Portugal (Lecture Notes in Computer Science), A. Jorge,L. Torgo, P. Brazdil, R. Camacho, and J. Gama, Eds. Berlin, Germany:Springer-Verlag, 2005, pp. 392403.
[44] S. Nijssen and J. N. Kok, A quickstart in frequent structure mining canmake a difference, in Proc. Int. Conf. Knowl. Discov. Data Mining, 2004,pp. 647652.
[45] R. Chittimoori, L. B. Holder, and D. J. Cook, Applying the subduesubstructure discovery system to the chemical toxicity domain, presentedat the AAAI Spring Symp. Predictive Toxicol. Chem.: Exp. Impact AITools, Menlo Park, CA, 1999.
Winnie W. M. Lam received the B.Sc. (Hons.) de-gree in information technology from Hong Kong
Polytechnic University, Hung Hom, Hong Kong. Sheis currently working toward the Ph.D. degree in theDepartment of Computing, Hong Kong PolytechnicUniversity.
She has been involved in several large-scale com-mercial projects, including the ESDlife electronicsystem of the Government of the Hong Kong SpecialAdministrative Region (HKSAR), the system migra-tion project in the Hong Kong Exchanges and Clear-
ing Limited, the data mining development in the Kowloon-Canton RailwayCorporation and Immigration Department, and the consultancy project in theSPSS Inc. Her research interests include data mining, bioinformatics, and arti-ficial intelligence.
Keith C. C. Chan (M94) received the B.Math.degree in computer science and statistics, and theM.A.Sc. and Ph.D. degrees in systems design en-gineering from the University of Waterloo, ON,Canada, in 1984, 1985, and 1989, respectively.
He joined the IBM Canada Laboratory as a SeniorAnalyst and was involved in the design and devel-opment of image and multimedia, and software en-gineering tools. In 1993, he joined as an AssociateProfessor in the Department of Electrical and Com-puter Engineering, Ryerson University, Toronto, ON.
In 1994, he joined the Department of Computing, Hong Kong Polytechnic Uni-versity, Hung Hom, Hong Kong, where he is currently a Professor. He has beena Consultant in various companies and other parts of Asia and Europe. His re-search interests include data mining, bioinformatics, software engineering, andpervasive computing.Graphic here