learning issues in drug discovery joe verducci ohio state university snowbird, june 2003
TRANSCRIPT
![Page 1: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/1.jpg)
Learning Issues in
Drug Discovery
Joe Verducci Ohio State University
Snowbird, June 2003
![Page 2: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/2.jpg)
2
The Basic Learning Problem
• Given a training set of biologically active and inactive chemical compounds, develop a classification rule based on the structural features of the compounds.
• Activity is determined from bioassays; for example, it might be the ability of a compound to inhibit the growth of a specific type of cancer cell.
• Structural features are coded as (long—up to lengths of 30K) binary strings, indicating the presence of basic molecular descriptors.
![Page 3: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/3.jpg)
3
Benzenes Heterocycles
Functional Groups Pharmacophores
Spacer groups
Examples of Molecular Descriptors
O
N
Ak
N
NN
PCCPCC
HBA
ONH2
N
O
Any NH
O
![Page 4: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/4.jpg)
4
Outline of Issues
• How to choose an appropriate kernel?– Biological heuristics– Localization: use class membership in constructing kernels
• Identifying groups of similarly structured active compounds– Recursive Partitioning– Simulated Annealing
• Clustering chemical classes– COSA– Jaccard/Tanimoto metric
– Relationships between features• Over different types of activity • Information from relational databases• Feature assembly
• How to choose molecules for the training set?
![Page 5: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/5.jpg)
5
Biological Heuristics
• “Key” to receptors comprises up to 3 features.
• There may be several receptors.
• Features around a “key” may prevent its use.
• Physical properties of a compound may inhibit its approach to the receptor.
• Suggests weighted polynomial kernel.
• Suggests non-zero weights over several groupings of features.
• Gives interpretation to negative weights
• Suggests that simple weightings apply only to similar types (“local” classes) of compounds.
![Page 6: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/6.jpg)
6
Discovery Goals beyond Classification
• Weightings should be interpretable (concentrated on only a few feature-combinations).
• If we know what features make a members of a class of compounds active for one type of cell (cancer) and which features make members of this class inactive against another type (normal), it may be possible to design a new drug in that class with both sets of features.
• Understand how kernels adapt to classes
![Page 7: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/7.jpg)
7
Localization
• Structural Activity Relationship (SAR)– about a 50 year history in Chemistry – all analyses done using a small group of similar
compounds– most analyses done with continuous variables (e.g.
lipophilicity, BCUTS)– SVM methods now enable analyses with many binary
variables• How to identify relevant “small groups” from a
large database?– Concentrate on pockets of active compounds– Concentrate on “natural” chemical classes
![Page 8: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/8.jpg)
8
Clustering active groups
• Recursive Partitioning (RP)– Split database sequentially according to the
feature that maximizes difference in mean activity and/or proportion of actives
• RP + Simulated Annealing (RPSA)– Stochastic search for combinations of
features that approximately optimize split
![Page 9: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/9.jpg)
9
Ave pGI50 = 4.47
Freq = 28,297
Ave pGI50 = 4.44
Freq = 27,521
Ave pGI50 = 4.92
Freq = 2,113
Ave pGI50 = 4.4
Freq = 25,408
Ave pGI50 = 5.36
Freq = 776
Ave pGI50 = 7.08
Freq = 76
Ave pGI50 = 5.17
Freq = 700
O
Ak(cyc)
O
O
Ak
O
O
Recursive Partitioning (RP)Applied to LNS-H23 activity in NCI database
![Page 10: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/10.jpg)
10
9
RP parameters: max p-value = 0.01, min set size = 50
8988
9087
86 91
9285
84 93
9483
82
7473
72
8180
79
71
7776
78756665
64
6968
67
63 70
62
61
45
5251
5350
49 54
5548
47
5958
6057
56
46
3
3530
29
3837
36
28
3433
3231
4140
39
27
21
5
2524
2623
22
10
8
1312
11
7
1716
15
2019
18
14
6
4
4443
42
2
1
0
Legend (Ave. pGI50)
> 7
6 – 7
5 – 6
< 5
RP Tree
![Page 11: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/11.jpg)
11
Recursive Partitioning (RP)
Advantages
Useful for explaining complex, nonlinear response.
Handle very large descriptor sets with continuous, discrete, or categorical variables
Handle very large data sets
Disadvantages
Only optimizes one variable at a time
Looks at few combinations of descriptors
Most terminal nodes involve many negative descriptors
![Page 12: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/12.jpg)
12
Stochastic Tree Search
At each node, simulated annealing is used to find a combination of structural features
Control parameters:• Number of features (descriptors) • Minimum node size• Maximum negative features• Number of tree levels
Want to find local optima
Modification -- drop certain features in the process
![Page 13: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/13.jpg)
13
Legend (Ave. pGI50)
> 7
6 – 7
5 – 6
< 5
9(9)
8(8)
(7) 7
6(6)
(5) 5
4(4)
(3) 3
2(2)
(1) 1
0
10(10)
RP/SA parameters: min set size = 50, number of features in combination = 2.
Stochastic Tree
Node Ave.pGI50 Count Features
1 7.35 51 oxetane, 3-oxy-; hdonor-path8-hdonor 2 7.49 54 benzene, 1-carbonyl, 4-(2-oxyethyl);hdonor-path8-pcharge 3 7.11 53 carbonyl, oxymethyl-; pyridine, 2-(alkenyl, cyc)- 4 6.66 52 oxepin, 3-oxymethyl-; alcohol, s-alkyl- 5 7.6 60 benzene, 1,3-dimethoxy-; cycloheptatriene, 1,3,5-
![Page 14: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/14.jpg)
14
Compound Classes
OMe O
O OH
OH
OOH
O O
OH
NH2
OH
Adriamycin (anthracyclines)
N
N
O
O
OOH
Camptothecin
N
O O
NH2
O pep Opep
Actinomycin D (portion)
O
O
O
OH
O
MeO
OMe
OMe
Podophyllotoxin
O
OAcO OH
AcOOBz
OPh
ON
OH
Ph
O
Taxol
OMe
MeO
MeO
O
OMe
NAc
Colchicine
O
O
O
O
O
O
HO
O
O
Verrucarin
![Page 15: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/15.jpg)
15
Clustering Active Compounds
OMe O
O OH
OH
OOH
O[carb]
OH
anthracyclines
acridines
N
N+
O-ONH
R
N
N
HN
S
N
N
N
O
O
OOH
Camptothecin
O
OAcO OH
AcOOBz
OPh
ON
OH
Ph
O
Taxol
OMe
MeO
MeO
O
OMe
NAc
Colchicine
0.0
0
.2
0.
4
0.6
0.8
1
.0
O
O
O
O
O
O
HO
O
OVerrucarin
![Page 16: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/16.jpg)
16
Active Outliers
N N
N+
O
O O
O
O
O
OO
O
0.0
0
.2
0.
4
0.6
0.8
1
.0
(n-Bu)3PbCl
![Page 17: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/17.jpg)
17
Clustering Easily Identified Chemical Classes
• Jaccard/Tanimoto metric– Most related to activity (Near Neighbor rules comparing metrics
-- Peter Willett)– Discounts similarity based on common absence of structures – Previous clustering just used active compounds. Now use all
compounds. This is needed to see if test compound is close to an inactive class.
• COSA– Friedman and Meulman (2002)– Weighs different features by (estimated) class to determine
distances between objects in the same (estimated) class– Results not yet ready.
![Page 18: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/18.jpg)
18
Tanimoto Coefficient
a = # bits on in A
b = # bits on in B
c = # bits on in both A and B
d = # bits off in both A and B
Tanimoto Coefficient
cba
cT
1
Measures similarity using on bits
dcba
dT
20
Measures similarity using off bits
Tanimoto Coefficient Complement
![Page 19: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/19.jpg)
19
OMeMeO
R2O
O
SMe
ZR1
38 compoundsAve pGI50 = 7.74
OMeMeO
R2O
O
OMe
NR1
23 compoundsAve pGI50 = 6.94
OHHO
R2O
O
SMe
NR1
17 compoundsAve pGI50 = 5.05
OMe
OMe
MeO
R
O
SMe
9 compoundsAve pGI50 = 6.96
R-Group Analysis ofColchicine Class
![Page 20: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/20.jpg)
20
Alternatives to R-Group Analysis
• Search all triplets of features present in the class– Get 7 categories for each triplet– Compute average activity in each category– Use ensemble prediction based on the best k triplets
(with at most one feature in common).
• Preferred Explanatory Features– Assemble the basic structures into new features that
could behave as R-groups– Do SVM using only these new features
![Page 21: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/21.jpg)
21
Relationships Between Features
• Information from relational databases– Similar correlations with IG50 for several types of
cancer cells– Similar correlations with levels for several (co-
expressed) genes
• Feature assembly– Check if associated features are connected– If so, assemble (may be several ways)– Check if assembly can be connected to common
scaffold
![Page 22: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/22.jpg)
22
Conceptual Framework
Database S
(MolecularStructureFeatures)
Database A
(ActivityPatterns)
Database T
(MolecularTargets)
60 Cell Lines
27,
00
0 F
eatu
res
4,4
63
Cm
pds
4,463 Cmpds 3,748 Genes
60
Ce
ll L
ine
s
SAT
(FeatureGene
Correlation)
3,748 Genes
27,
00
0 F
eatu
res
![Page 23: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/23.jpg)
23
NCI Gene Expression Dataset• Microarrays spotted with 9703 cDNA elements
– mRNA isolated from NCI 60 cancer cell linesLeukemia (6) Melanoma (7) Breast (8)Ovarian (6) CNS (6) Lung (9)Prostate (2) Colon (7) Kidney (8)
– 12 cell lines used for reference pool– Fluorescence tagged during hybridization
• DNA elements are from Washington Univ. Merck IMAGE– ~3700 named genes– ~ 1,900 human homologues– 4104 EST
* Source: http://discover.nci.nih.gov; U. Scherf, et. al., Nature Genet., 2000, 24, 236–44.
![Page 24: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/24.jpg)
24
Compounds Used in Study
• NCI 4,463 compounds tested 2 or more times
• Each compound tested at 5 concentrations, usually 10-4M - 10-8M
• Used growth inhibition (GI50) of compounds over NCI60 cell lines
![Page 25: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/25.jpg)
25
Breast CNS Colon Leukemia Lung Melanoma Ovarian Renal
Gene 486676
-2
-1
0
1
2
3
Compound 661223
-2
-1
0
1
2
3
Cell lines
Standardized Compound-activity vs Gene-expression*
* across NCI60 cell lines
![Page 26: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/26.jpg)
26
Compound-Gene Correlations
O
O
S
benzothiophenedione
O
O
HN
indolonaphthoquinone
Compound class correlated with melanoma gene Rab7
Compound class correlated with leukemia gene CARS-cyp
![Page 27: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/27.jpg)
27
Class Count CARS-cyp Rab7
actinomycin 12 -1.36 1.69
anthraquinone 65 2.11 -6.62
aziridinylquinone 11 -3.76 0.44
benzothiophenedione 23 -7.25 10.50
indolonaphthoquinone 20 5.75 -2.03
quinoneimine 46 -2.88 5.47
Quinone-Gene Correlations*
* values are z-scores of compound class-gene correlation
CARS-cyp human Clk associated RS cyclophilin Rab7 human small GTP binding protein
![Page 28: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/28.jpg)
28
Additional Databases• Chemical Compounds
– Atoms– Structures
• 2 dimensional• 3 dimensional
– Physical Properties• BioAssays
– In vitro– In vivo
• Clinical Trials– Phase I– Phase II– Phase III
• Target Information• Known Drugs
– Responsive subpopulations– Adverse side effects
![Page 29: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/29.jpg)
29
Uses of Macrostructures
• Discriminate for biological activity in a local neighborhood
• Cluster signatures - discriminate for member-ship in the cluster
• Provide scaffolds for R-group analysis
![Page 30: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/30.jpg)
30
Macrostructure Assembly
MeO
MeO
MeOO
S MeS
N
O
Selected building blocks
![Page 31: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/31.jpg)
31
Assembling Macrostructures
MeO
MeO
MeOO
S MeS
N
O
O
MeO
MeO
MeO
OO
S
N
O
![Page 32: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/32.jpg)
32
Higher Level Assembly
O
S
N
OO
MeO OMe
O
MeO
N
O
O
MeO
S
O
MeO
S
Me
OMe
N
O
![Page 33: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/33.jpg)
33
R-Group Analysis
![Page 34: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/34.jpg)
34
Designing a Training Set
• Edge Designs
• Coverage Designs
• Spread Designs
![Page 35: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/35.jpg)
35
Spread Design
Select a subset S of fixed size m so as to maximize the minimum distance between points in S.
Higgs’ Algorithm: -- Choose points sequentially: At each step, maximize minimum distance to already selected points. -- Leads to “near optimal” solution
Choice of distance greatly effects resulting design.
![Page 36: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/36.jpg)
36
XOR (Hamming Distance)XOR (Hamming): Only accounts for bits that don’t match
Larger structures have more bits that don’t match each other
Diversity Result: Tends to favor larger structures with a lot of features
A: 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 … 0 1 0 0 0B: 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 … 0 0 0 1 1
k
kkXOR XORd B )(A2
![Page 37: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/37.jpg)
37
Modified Tanimoto
01 )1( TTMT
Measure similarity based on the both the presence (on bits) and absence (off bits) of features
where ,3
2 p .
2 and
n
bap
When there are fewer on bits: T1 is weighted more heavily.When there are fewer off bits: T0 is weighted more heavily.
As a variation, p may be fixed by external considerations. The result is called the P-Modified Tanimoto distance.
![Page 38: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/38.jpg)
38
Implementing Spread Designs
• Maximin vs Average Distance
• Higgs’ Algorithm
• Stochastic Searches
• Near Optimal Solutions
![Page 39: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/39.jpg)
39
Medicinal Drug Database
• 186 Leadscope Features – Prevalence Range: 0.001-0.956– Median: 0.090 – Mean: 0.142
• 1089 Drugs now in market– Range: 5-70 distinct features per compound– Median: 24 (12.8%) features per compound– Mean: 26.4 (14.2%) features per compound
![Page 40: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/40.jpg)
40
Procedure
• Use Higgs algorithm
• Apply with 4 different metrics
• Use each of 1089 compounds as initial seed
• Pick best (maximin distance) 150 designs for each metric
• Evaluate balance criterion for all designs
• Summarize
![Page 41: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/41.jpg)
41
Average Number of Distinct Features of Sampled Compounds
(Population Median 24 features/cmpd)
Distance
Sample Size
Hamming Tanimoto Mod.Tan. P-Mod.Tan
P = .5
10 45.7 14.8 20.1 21.1
20 44.8 16.0 20.2 21.0
40 43.7 16.9 21.2 21.3
![Page 42: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/42.jpg)
42
Balances of Best Spread Design(of size 20) for Each Distance
P1
ba
lan
ce
cri
teri
on
0.05 0.10 0.15 0.20 0.25
20
40
60
80
10
0
tanimotomodified tanimotop-modified tanimotohamming
![Page 43: Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003](https://reader035.vdocuments.mx/reader035/viewer/2022081506/56649ee85503460f94bf9183/html5/thumbnails/43.jpg)
43
AcknowledgementsOhio State University
Statistics Michael Fligner
Joseph Verducci
Medicinal Chemistry Robert Brueggemeier
Jeanette Richardson
NCI John Weinstein, MD, PhD
LeadScope, Inc.
Computational Chem. Paul Blower
Kevin Cross
Glenn Myatt
Chihae Yang
Funding NCI SBIR 1R43CA96083
TAF ODOD