discovering substructures in chemical toxicity domain masters project defense by ravindra nath...

28
Discovering Discovering Substructures in Substructures in Chemical Toxicity Domain Chemical Toxicity Domain Masters Project Defense Masters Project Defense by by Ravindra Nath Chittimoori Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson Cook , DR. Lynn Peterson Department of Computer Science and Department of Computer Science and Engineering Engineering University of Texas at Arlington University of Texas at Arlington

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Discovering Substructures in Discovering Substructures in Chemical Toxicity DomainChemical Toxicity Domain

Masters Project DefenseMasters Project Defense

by by

Ravindra Nath ChittimooriRavindra Nath Chittimoori

Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn PetersonDR. Lynn Peterson

Department of Computer Science and Department of Computer Science and EngineeringEngineering

University of Texas at ArlingtonUniversity of Texas at Arlington

Page 2: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

OutlineOutline

Chemical Toxicity Database Chemical Toxicity Database Motivation and Goal Motivation and Goal Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD) (KDD) SUBDUE Knowledge Discovery System SUBDUE Knowledge Discovery System Experiments with Unsupervised Experiments with Unsupervised SUBDUE SUBDUE Experiments with Supervised SUBDUE Experiments with Supervised SUBDUE Discussion of Results Discussion of Results ConclusionsConclusions Future WorkFuture Work

Page 3: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Chemical Toxicity Database Chemical Toxicity Database

Carcinogenesis Prediction Problem Carcinogenesis Prediction Problem

Toxicology Evaluation Challenge Toxicology Evaluation Challenge

Domain:Domain: CompoundsCompounds + + - -TotalTotal Training set Training set 162 162 136 136 298298 Experimental set Experimental set 27 27 25 25

69 69

Page 4: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Motivation and GoalMotivation and Goal

Ever-increasing number of chemical Ever-increasing number of chemical compoundscompounds

Needs analysis to obtain the Structure-Needs analysis to obtain the Structure-ActivityActivity relationships of a compound relationships of a compound

Determine SUBDUE’s applicability to Determine SUBDUE’s applicability to chemicalchemical toxicity domaintoxicity domain

Page 5: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Knowledge Discovery in Knowledge Discovery in Databases (KDD) Databases (KDD)

Process of identifying valid, novel, Process of identifying valid, novel, potentiallypotentially useful and understandable patterns in useful and understandable patterns in datadata

Goal of Knowledge Discovery:Goal of Knowledge Discovery: VerificationVerification DiscoveryDiscovery

Data mining methods Data mining methods

Model Representation, Evaluation and Model Representation, Evaluation and SearchSearch

Page 6: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Steps in KDD Steps in KDD

Identify the goal of the process Identify the goal of the process Collect, create and prepare the dataset Collect, create and prepare the dataset Select the data mining method Select the data mining method Select the data mining algorithm Select the data mining algorithm Transform the data Transform the data Execute the algorithm Execute the algorithm Interpret/evaluate the discovered Interpret/evaluate the discovered patterns patterns Consolidate the knowledge discovered Consolidate the knowledge discovered

Page 7: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

SUBDUE Knowledge SUBDUE Knowledge Discovery System Discovery System

SUBDUE discovers patterns SUBDUE discovers patterns [substructures] in structural data sets[substructures] in structural data sets

objectobjecttriangletriangle

objectobjectsquaresquareonon

shapeshape

shapeshape

Vertices: objects or attributesVertices: objects or attributesEdges: relationshipsEdges: relationships

4 instances of4 instances of

Page 8: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

SUBDUE - Input SUBDUE - Input Representation Representation

Each atom is represented as a vertex Each atom is represented as a vertex withwith directed edges to the name, type and directed edges to the name, type and the partialthe partial charge of the atomcharge of the atom

Bonds are represented as undirected Bonds are represented as undirected edges edges

Each group is represented as a vertex Each group is represented as a vertex having ahaving a string label specifying the group string label specifying the group name withname with directed edges to all participating directed edges to all participating atomatom verticesvertices

Page 9: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

SUBDUE - Input SUBDUE - Input Representation Representation

Representation used in Unsupervised Representation used in Unsupervised SUBDUE SUBDUE A vertex having a string label A vertex having a string label specifying thespecifying the alert with directed edges to all the alert with directed edges to all the atoms inatoms in the compound the compound

Representation used in Supervised Representation used in Supervised SUBDUE SUBDUE A vertex for all the compounds with A vertex for all the compounds with string labelstring label compoundcompound The compound vertex has directed The compound vertex has directed edges to alledges to all the vertices representing the the vertices representing the activity of anactivity of an alert on a compound alert on a compound

Page 10: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Unsupervised SUBDUE Input Unsupervised SUBDUE Input Representation ExampleRepresentation Example

C

0.062pt

n

Ames

0.0631010C

Methyl

Atom Atompt n

gr

grpo

po

1

n - Namen - Name

t - Typet - Type

p - Partial p - Partial chargecharge

po - Positivepo - Positive

gr - groupgr - group

Page 11: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE Input Supervised SUBDUE Input Representation ExampleRepresentation Example

C

0.062pt

n

Com

0.0631010C

Methyl

Atom Atompt n

gr

grcontains

1

contains

Ames

Positive

n - Namen - Name

t - Typet - Type

p - Partial p - Partial chargecharge

gr - groupgr - group

Com - Com - CompoundCompound

Page 12: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

SUBDUE - Model Evaluation SUBDUE - Model Evaluation

Minimum Description Length Principle Minimum Description Length Principle Best theory to describe any graph Best theory to describe any graph Minimize I(S) + I(G/S)Minimize I(S) + I(G/S)

Graph Compression Graph Compression

Page 13: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Other important Concepts of Other important Concepts of SUBDUESUBDUE

Inexact Graph Match Approach Inexact Graph Match Approach

Concept - Learning Concept - Learning

Predefined Substructures Predefined Substructures

Page 14: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Unsupervised SUBDUE - Unsupervised SUBDUE - Methodology Methodology

Training set further divided Training set further divided

3 approaches to determine 3 approaches to determine carcinogenicity of compounds in carcinogenicity of compounds in experimental set experimental set -- Apply SUBDUE individually to the -- Apply SUBDUE individually to the compoundscompounds-- Inclusion of pre-defined -- Inclusion of pre-defined substructuressubstructures-- Check for matching of substructure -- Check for matching of substructure in thein the compound to be classifiedcompound to be classified

Page 15: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Unsupervised SUBDUE - Unsupervised SUBDUE - ResultsResults

atom

10

c

n

t p0.062

atom

br

n

t p0.057

1

3

Third approach used to classify Third approach used to classify compounds in compounds in

experimental set experimental set

Accuracy Level -> 0.322Accuracy Level -> 0.322

Cyanate & ether groups are also Cyanate & ether groups are also discovered todiscovered to

be indicators of carcinogenic activity be indicators of carcinogenic activity

Page 16: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - Methodology Methodology

Create set of indicators of carcinogenic Create set of indicators of carcinogenic activity activity

Create set of indicators of Create set of indicators of noncarcinogenicnoncarcinogenic activity activity

Calculate value of substructures Calculate value of substructures discovered indiscovered in carcinogenic and noncarcinogenic set carcinogenic and noncarcinogenic set

Select a set of substructures to be Select a set of substructures to be used inused in classifying compounds in classifying compounds in experimental set experimental set

Page 17: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - MethodologyMethodology

Check for the existence of these Check for the existence of these substructures insubstructures in the compound to be classified the compound to be classified

Calculate the Carcinogenic Activity Value Calculate the Carcinogenic Activity Value of theof the compound compound

Calculate the NonCarcinogenic Activity Calculate the NonCarcinogenic Activity Value of theValue of the compound compound

Determine the activity of the compound Determine the activity of the compound

Page 18: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Results Supervised SUBDUE - Results

A set of 12 substructures discovered by A set of 12 substructures discovered by SUBDUE used to classify compounds in the SUBDUE used to classify compounds in the experimental setexperimental set

6 substructures from carcinogenic set 6 substructures from carcinogenic set include substructures which form part of include substructures which form part of groups like amino, di10, methyl, ether, groups like amino, di10, methyl, ether, halide10 and substructure which indicates halide10 and substructure which indicates compound testing positive on AMES, compound testing positive on AMES, Salmonella, etc.Salmonella, etc.

6 substructures from noncarcinogenic set 6 substructures from noncarcinogenic set include substructures which form part of groups include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates alkyl_halide and substructure which indicates compound testing negative on AMES, compound testing negative on AMES, Salmonella, etc.Salmonella, etc.

Page 19: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - Carcinogenic SetCarcinogenic Set

Ames

Salmonella

Salmonella_n

Compound

positive

positive

positive

Page 20: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - Carcinogenic SetCarcinogenic Set

Cl

-0.024

p

gr

t

n

-0.1239310C

AtomAtom

Halide10

gr

pt

n

n - Namen - Name

t - Typet - Type

p - Partial p - Partial chargecharge

gr - groupgr - group

Page 21: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - NonCarcinogenic SetNonCarcinogenic Set

Ames

Salmonella

Cytogen_ca

Compound

negative

negative

negative

Page 22: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - NonCarcinogenic SetNonCarcinogenic Set

Cl

Atom

0.477

pt

n

gr

-0.1249310C

Atom

A-H

ptn

gr n - Namen - Name

t - Typet - Type

p - Partial p - Partial chargecharge

gr - groupgr - group

A-H - Alkyl A-H - Alkyl HalideHalide

Page 23: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Results Supervised SUBDUE - Results

PTE-1 Results: PTE-1 Results: CompoundsCompounds + + -- TotalTotal PTE-1 PTE-1 20 20 1919 39 39 Correct PredictionCorrect Prediction 12 12 66 18 18 Incorrect Prediction 8Incorrect Prediction 81313 22 22

Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 (total)(total)

Page 24: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Supervised SUBDUE - Results Supervised SUBDUE - Results

PTE-2 Results:PTE-2 Results: CompoundsCompounds + + - -

TotalTotal PTE-2 PTE-2 7 7 6 6

13 13 ** Correct PredictionCorrect Prediction 4 4 3 3

7 7 Incorrect Prediction 3Incorrect Prediction 3 3 3

6 6 * :* : # of compounds whose activity is # of compounds whose activity is knownknown

Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total) (total)

Page 25: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Results - Discussion Results - Discussion

Unsupervised SUBDUE successful in Unsupervised SUBDUE successful in discoveringdiscovering lead indicators of carcinogenic activity lead indicators of carcinogenic activity

Supervised SUBDUE also successful inSupervised SUBDUE also successful in discovering lead indicators of discovering lead indicators of carcinogeniccarcinogenic activity activity

ILP System PROGOL: PTE-1 (0.72), PTE-ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62)2 (0.62)

Ashby, TOPKAT are other toxicity Ashby, TOPKAT are other toxicity predictionprediction methods methods

Page 26: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Conclusions Conclusions

Consistent with results obtained by Consistent with results obtained by logic basedlogic based systems like PROGOL systems like PROGOL

Prefer to use Concept Learner when Prefer to use Concept Learner when positive andpositive and negative examples of target concept negative examples of target concept available available

SUBDUE is capable of discovering leadSUBDUE is capable of discovering lead indicators of indicators of carcinogenic/noncarcinogeniccarcinogenic/noncarcinogenic activity in chemical toxicity domain . activity in chemical toxicity domain .

Page 27: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

Future WorkFuture Work

PTE-3 Evaluation Challenge PTE-3 Evaluation Challenge

Trimmed Data Sets (Partial Charge)Trimmed Data Sets (Partial Charge)

Newer Version of Concept Learning Newer Version of Concept Learning SUBDUE beingSUBDUE being

developeddeveloped

Page 28: Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane

ReferenceReference

http://cygnus.uta.edu/subduehttp://cygnus.uta.edu/subdue