predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms
TRANSCRIPT
Chemometricsand intelligent
i laboratory systems ELSEVIER Chemomemcs and Intelligent Laboratory Systems 38 (1997) 127-137
Predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms
Frank R. Burden a, Brendan S. Rosewarne a, David A. Winkler b, * a Chemistry Depamnt, Monash University, Clayton, Victoria 3168, Australia
b CSIRO Division of Molecular Science, Private Bag IO, Clayton, Victoria 3168, Australia
Received 10 November 1996; accepted 28 April 1997
Abstract
Recently neural networks have been applied with some success to the study of quantitative structure activity relation- ships. One limitation of their use is that, while they are able to predict the biological activity of a new molecule from its physicochemical properties, it is difficult to get them to solve the more interesting problem of predicting the required molec- ular properties of a more active molecule. This paper proposes one method for solving this problem by using genetic algo- rithms and explores their potential as a method for solving this problem. Suggestions for more potent dihydrofolate reductase inhibitors are made. 0 1997 Elsevier Science B.V.
Keywords: Neural network; Genetic algorithm; QSAR; DHFR inhibition; Drug design; Activity prediction
1. Introduction
A relatively recent and interesting development in the field of QSAR has been the application of artifi- cial neural networks (ANNs) to the discovery of structure-activity mappings. Recent work has shown that ANNs do indeed show promise in this applica- tion [l-3].
ANNs provide several benefits when compared with conventional multivariate QSAR methods. Firstly, they do not require any assumptions to be made about the nature of the functional relationships between molecular descriptors and biological activ- ity. Secondly, they are generally able to function well without the use of indicator variables often required for conventional regression analyses [4]. Finally, they
* Corresponding author. E-mail: [email protected].
are often capable of producing better generalizations, since they are able to model a complicated activity surface better than conventional methods.
ANNs do have drawbacks as they require rela- tively large data sets to train effectively, they can be over trained (leading to loss of generalization), may require long training times, and they are essentially a black-box with much of the details of how they func- tion hidden from the user. This last drawback makes it difficult to study the surfaces they create or esti- mate the limits of their accuracy. Thus, while it is possible to use neural networks to predict the activity of a molecule given its physicochemical properties, the inverse problem of finding the values of the inde- pendent variables which give the best possible activ- ity, is much more difficult. This paper explores the use of genetic algorithms 151 as a means for finding such maxima on structure-activity surfaces and draws
0169-7439/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved. PII SO169-7439(97)00052-X
Tabl
e 1
Stru
ctur
e, ex
peri
men
tally
dete
rmin
ed D
HFR
inh
ibit
ory
acti
vity
, and
phy
sico
chem
ical
pro
pert
ies o
f di
amin
odih
ydro
tria
zine
s
# R
lo
gl/C
r2
%
-3
rr4
MR
, M
R,
MR
, E
u~_~
1,
I2
I3
I4
I5
Is
Se
t
1 2,
5-C
l, 3.
43
2 2-
OC
H,
3.68
3
2,4X
1,
3.82
4
2-C
H 3
4.
00
5 2-
Cl
4.15
6
2-B
r 4.
25
7 2,
4,5-
C],
4.38
8
2-I
4.62
9
4-C
ON
HC
sH,-4
’-SO
,F
4.68
10
4-
CO
NH
CsH
,-3’-S
O,F
4.
68
11
4-C
,H,
4.70
12
2-
F 4.
74
13
3-O
CH
,CO
-N-(C
H$H
,),O
4.
85
14
4-C
N
5.14
15
4-
CH
=CH
CO
NH
-C,H
,-4’-S
O,F
5.
19
16
3-0C
H2C
ON
Me,
5.
44
17
4-C
H(P
hkH
,CO
NH
-C,H
,-4’-S
O,F
5.
74
18
4-C
1,3-
(CH
,)&H
,-4’-S
O,
5.82
19
4-
CH
=CH
CO
NH
-C,H
,-3’-S
O,F
5.
89
20
3-C
ON
HC
,H,-4
’-SO
,F
5.96
21
3-
NH
CO
CH
,Br,
4-O
(CH
,),C
,H,
6.11
22
3-
CH
,NH
CO
NE
t, 6.
11
23
3-O
CH
, 6.
17
24
4-O
CH
,CO
N(M
ek,H
, 6.
17
25
4-C
H&
H(C
H,C
H,P
h)-C
ON
HC
,H,-4
’-SO
*F
6.20
26
3-
CO
CH
,Cl
6.21
27
4-
CH
,CH
(a-C
,,H,)-
CO
NH
CsH
,-4’-S
O,F
6.
24
28
4-O
CH
,CO
NM
e,
6.26
29
4-
CH
&H
-(Ph-
2”-O
CH
,kO
NH
C,H
,-4’-S
O,F
6.
33
30
3-C
&4-
OC
H&
H,,-
CH
20C
,H,-4
’-SO
zF
6.37
31
3-
CH
(CH
,NH
CO
-CH
,BrX
CH
.&C
SH5
6.37
32
3-
CH
,NH
CO
-N(C
H,C
H,)2
0 6.
43
33
4-C
OC
H,C
l 6.
45
34
4-C
HJH
(Ph-
3”-O
CH
,kO
NH
C,H
,-4’-S
O,F
6.
46
35
4-C
H(C
H,N
HC
O-C
H,B
rXC
H,),
C,H
S 6.
52
36
2,3-
Cl,
6.52
37
2-
Cl,4
-(CH
,),C
,HS
6.54
38
3-
Cl,4
-O(C
H,),
O-C
,H,-4
’-SO
,C,H
,-4”-
Cl
6.55
39
3-
CH
,NH
CO
CH
2Br
6.58
40
3-
CO
NH
CsH
,-3’-S
O,F
6.
60
41
4-C
H,C
ON
Me,
6.
63
42
4-O
CH
,CO
-N(C
H2)
, 6.
66
0.71
0.
00
0.00
0.
60
0.10
0.
10
0.37
1
1 0
0 0
0 a
-0.0
2 0.
00
0.00
0.
79
0.10
0.
10
0.00
1
1OO
OO
a 0.
71
0.00
0.
71
0.60
0.
10
0.60
0.
23
1 1
0 0
0 0
a 0.
56
0.00
0.
00
0.57
0.
10
0.10
0.
00
1 1
0 0
0 0
a 0.
71
0.00
0.
00
0.60
0.
10
0.10
0.
00
1 1
0 0
0 0
a 0.
86
0.00
0.
00
0.89
0.
10
0.10
0.
00
1 1
0 0
0 0
a 0.
71
0.00
0.
71
0.60
0.
10
0.60
0.
60
1 1
0 0
0 0
a 1.
12
0.00
0.
00
1.39
0.
10
0.10
0.
00
1 1
0 0
0 0
a 0.
00
0.00
1.
50
0.10
0.
10
4.23
0.
36
1 0
1 0
0 0
b 0.
00
0.00
1.
50
0.10
0.
10
4.23
0.
36
1 0
1 0
0 0
d 0.
00
0.00
1.
96
0.10
0.
10
2.54
-0
.01
1 0
1 0
0 0
b 0.
14
0.00
0.
00
0.09
0.
10
0.10
0.
00
1 1
0 0
0 0
a 0.
00
- 1.
39
0.00
0.
10
3.32
0.
10
0.12
1
0 0
0 0
0 b
0.00
0.
00
-0.5
7 0.
10
0.10
0.
63
0.66
1O
OO
OO
b 0.
00
0.00
1.
99
0.10
0.
10
5.22
-0
.01
1 0
1 0
0 0
b 0.
00
- 1.
36
0.00
0.
10
2.41
0.
10
0.12
1O
OO
OO
b 0.
00
0.00
3.
53
0.10
0.
10
7.59
-0
.09
1 0
1 0
0 0
b 0.
00
2.71
0.
71
0.10
4.
39
0.60
0.
16
0 0
0 0
1 0
g 0.
00
0.00
1.
99
0.10
0.
10
5.22
-0
.01
1 0
1 0
0 0
b 0.
00
1.50
0.
00
0.10
4.
33
0.10
0.
35
1 O
lOO
Od
0.00
-
0.37
2.
66
0.10
2.
11
4.15
-0
.27
1OO
OO
Ob
0.00
-
0.29
0.
00
0.10
3.
56
0.10
-0
.07
0 0
0 0
0 0
e 0.
00
- 0.
02
0.00
0.
10
0.62
0.
10
0.12
1
0 0
0 0
0 b
0.00
0.
00
0.12
0.
10
0.10
4.
55
-0.2
7 1
0 0
0 0
0 b
0.00
0.
00
4.23
0.
10
0.10
8.
52
-0.1
7 1
0 0
0 0
0 b
0.00
-0
.16
0.00
0.
10
1.45
0.
10
0.38
1
OO
OO
Ob
0.00
0.
00
5.02
0.
10
0.10
9.
13
-0.1
7 0
0 0
0 0
0 e
0.00
0.
00
-1.3
6 0.
10
0.10
2.
58
-0.2
7 1
0 0
0 0
0 b
0.00
0.
00
3.51
0.
10
0.10
8.
27
-0.1
7 0
0 0
0 0
0 g
0.00
0.
71
5.16
0.
10
0.49
7.
25
0.10
0
0 0
0 0
0 e
0.00
2.
94
0.00
0.
10
6.94
0.
10
-0.0
7 1
0 0
0 0
0 b
0.00
-
1.32
0.
00
0.10
3.
53
0.10
-0
.07
0 0
0 0
0 0
e 0.
00
0.00
-0
.16
0.10
0.
10
1.62
0.
50
1 0
0 0
0 0
b 0.
00
0.00
3.
51
0.10
0.
10
8.27
-0
.17
0 0
0 0
0 0
e 0.
00
0.00
2.
94
0.10
0.
10
7.03
-0
.17
1 O
OO
OO
b 0.
7 1
0.71
0.
00
0.60
0.
49
0.10
0.
37
1 1
0 0
0 0
a 0.
71
0.00
3.
66
0.60
0.
10
4.39
-0
.17
1 1
0 0
1 0
a 0.
00
0.71
4.
92
0.10
0.
49
8.90
0.
10
0 0
0 0
0 0
e 0.
00
-0.5
2 0.
00
0.10
2.
57
0.10
-0
.07
1 O
OO
OO
b 0.
00
1.50
0.
00
0.10
4.
33
0.10
0.
35
1 0
1 0
0 0
d 0.
00
0.00
-1
.70
0.10
0.
10
2.37
-0
.17
1 0
0 0
0 0
b 0.
00
0.00
-0
.72
0.10
0.
10
3.31
-0
.27
1 0
0 0
0 0
b
F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 129
C,9P~)PQe,OPcoO04a,C,P~oOL,~c99C,PC,b,o9cMC,QMCIP”oc~~cgPu
0000000000000000000000000000000000-0000000000
00000-000000000000 -000000~0000000000000000000
000000000000000000000000000000000000000000000
00000000-000000000000000000000000000000000000
000000000000000000000000000000000000000000000
~~~0~00~~0000-----00-1-13oo-oo-oo333100-00-00----00000~0~
130 F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137
388888888888888888888888888888888888888888 jddddddddddddddddddddddddddddddddddddddd~~
F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 131
some conclusions about their usefulness and limita- tions.
2. Experimental
2.1. Biological data
Our data set of 256 5-phenyl-3,4-diamino-6,6-di- methyldihydrotriazine (1) dihydrofolate reductase (DHFR) inhibitors was obtained from the paper by Andrea and Kalayeh [l]. This comprises 132 com- pound sub-set relating to DHFR inhibition in Walker 256 carcinoma cells and the 113 compound sub-set relating to L12 10 tumor line (11 compounds had a non-hydrogen R, and were omitted from the analy- sis [l]). This DHFR inhibitor data set has been used as a de facto standard in several previous QSAR pa- pers (e.g., [6-81). In the Walker 256 data set, 100 compounds were assigned to the training set and 32 assigned to the test set on the basis of cluster analy- sis [l]. In the L1210 data set, 57 compounds were as- signed to the training set and 56 assigned to the test set also on the basis of cluster analysis [l].
2.2. Molecular representation
The input parameters related to the physico- chemical properties of substituents at the 3 and 4 po- sitions on the phenyl ring (compounds with non-hy- drogen substituents at R, were omitted from the analyses). These were: the hydrophobicities (rs, TTJ; molar refractivities (MR, and MR,); and the sum of the electronic parameters at positions 3 and 4 (Co, J. The full data set showing structures, biological activ- ities, substituent constants, and indicator variables, is given in Tables 1 and 2.
2.3. Neural network
In order to effect a direct comparison with earlier work [l], a 5:8:1 (i.e., 5 nodes in the input layer, 8 nodes in the hidden layer and one node in the output
layer) fully connected feed forward/back-propa- gation neural net was used. Previously eight hidden nodes were found to be the optimum number of nodes to minimize the variance for this dataset [2]. The data set was trained using Propagator [9] on a i486 desk- top computer. Transfer functions were linear for the input layers and sigmoidal for the hidden and output layers [l]. The training rate was 0.001 and the mo- mentum term 0.6.
Since Propagator is not programmed to carry out a full cross-validation study where each data point is, in turn, abstracted from the full set at each cycle, a partial validation was carried out. The partial valida- tion was carried out by taking a randomly chosen sub-set of 15 data points from the test set of Ref. [l] to serve as a validation set; this was done seven times to check for outliers in the training but none were found. The final training, test and validation set, used to train the network to be searched by the genetic al- gorithm, are denoted in Tables 1 and 2. The number of training cycles was dictated by the validation curve i.e. when the validation error started to rise training was stopped and the weights saved.
In most neural net studies the biological data are scaled to the range 0.01-0.99 to accommodate the sigmoid transfer function in the output layer. How- ever, in order to give more latitude for extrapolation and avoid saturation of the sigmoid transfer function in the neural network, the log 1 /C values were scaled to the range 0.2-0.8 in our study; a narrower range over-restricts the squashing functionality of the sig- moid function.
2.4. Genetic algorithm
After training the neural network, the weights were used to derive a forward propagation formula. A commercial genetic algorithm program, Evolver [ 101 was then used to search the activity surface for max- ima. Evolver uses 16 bit precision for real numbers which gives a precision of 6.5 decimal digits. In this genetic algorithm search a gene pool of 50 was used, with a crossover rate (probability of mating) of 0.5 or 50%. The mutation rate (probability of random bit mutation) was 0.06 or 6%. The stopping criterion was that the change in the last 100 generations was less than 0.0001. The GA program, Evolver, returns only the optimum solution so that sub-optimal solutions are not available for this report.
32 F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137
QQOccn~OOOar,ccoMoo~onon~nooMMc~cnnno~~~no~aoQaJ
00-1-0000000-0000000000000-~0000+00-0-0000
000000000000000 ---ooooooooooooooo~~ooooo~~
000000000000000000000~0000000000000
000000000000000000000000000000000000000000
000000000000000000000000000000000000000000
00004--3-1oo-ooooo-----0000-10000
172
3-C
1,4-
(CH
,),C
,H,-5
’-Cl,Z
’-SO
,F
7.85
17
3 3-
C1,
4-(C
H,),
C,H
,-3’-C
1,4’
-SO
,F
7.85
17
4 3-
C1,
4-O
CH
,CO
-N(C
H,C
H,),
O
7.85
17
5 3-
C1,
4-O
CH
&H
,-3’-C
ON
(CH
,CH
,),O
7.
85
176
3-C
1,4-
OC
H,C
,H,-3
’-CO
-N(C
H,)4
7.
85
177
3-C
1,4-
OC
H,C
ON
-(Me&
H5
7.89
17
8 4-
OC
H,C
ON
HC
,H,
7.89
17
9 4-
(CH
,),C
,H,
7.89
18
0 4-
(CH
,),C
ON
HC
sH,-3
’-Me,
4’-S
O,F
7.
89
181
3-C
1,4-
CH
,NH
CO
NH
-C,H
,-4’-S
O,F
7.
92
182
3-C
1,4-
O(C
H,),
NH
-CO
NH
C,H
,-4’-S
O,F
7.
92
183
4-(C
H&
ON
HC
sH,-3
’-SO
*F
7.92
18
4 4-
(CH
,),C
OC
H,C
l 7.
92
185
3-O
C,H
,-4’-N
HC
OC
H,B
r 7.
92
186
3-C
1,4-
(CH
,)&H
, 7.
92
187
4-(C
H,),
C,H
,-2’,4
’-Cl*
7.
92
188
3-C
1,4-
(CH
,),C
,H,
7.96
18
9 3-
O(C
H,)$
CsH
,-4’-S
O,F
7.
96
190
3-(C
H,),
C,H
,-5’-C
1,2’
-SO
,F
7.96
19
1 4-
(CH
,).,C
,H,-2
’-C1.
4’-S
O,F
7.
96
192
3-C
1,4-
OC
H,C
,H,-4
’-C1,
3’-S
O*F
8.
00
193
3-(C
H,&
H,-2
’-Cl,4
’-SO
,F
8.00
19
4 4-
OC
H&
ON
HC
,H,-3
’-SO
,F
8.00
19
5 3-
C1,
4-O
CH
&H
,-3’-C
ON
HC
,H,
8.00
19
6 3-
CH
,C,H
, 8.
00
197
4-(C
H&
,H,
8.00
19
8 3-
C1,
4-O
CH
,C,H
,-3’-C
O-N
(CH
z)5
8.02
19
9 3-
CH
,NH
CO
NH
C6H
,-3’-O
CH
, 8.
02
200
4-(C
H,),
CO
NH
C,H
,-4’-M
e,3’
-SO
,F
8.02
20
1 3-
C1,
4-@
H&
&H
,-3’-S
O,F
8.
03
202
3-(C
H,),
C,H
,-2’,4
’-C1,
8.
03
203
4-C
H,N
HC
ON
HC
,H,-3
’-SO
,F
8.04
20
4 4-
(CH
,)$O
N(M
e)-C
,H,-4
’-SO
,F
8.04
20
5 3-
C1,
4-(C
H,),
C,H
,-4’-C
l,Z’-S
O,F
8.
05
206
4-C
H,C
,H,
8.05
20
7 3-
CH
,NH
CO
NH
C6H
,-3’-C
l 8.
05
208
3-C
&4-
O(C
H,),
NH
-CO
NH
HC
,H,-3
’-Me,
3’-S
O,F
8.
06
209
4-C
H,C
ON
HC
,H,-3
’-S02
F 8.
06
210
4-(C
H,),
CO
NH
C,H
,-6’-O
Me,
3’-S
O,F
8.
08
211
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C6H
,-3”-
CF,
8.
09
212
3-C
H,N
HC
ON
HC
sH,-3
’-NO
? 8.
10
213
3-(C
H,)&
H,-4
’-SO
,F
8.10
21
4 3-
(CH
,),C
,H,-3
’-SO
,F
8.10
21
5 3-
(CH
,&H
,-4’-S
O,F
8.
10
216
4-(C
H2)
,NH
CO
C6H
,-4’-S
O,F
8.
11
0.00
0.
71
3.42
0.
10
0.49
4.
73
0.20
0.
00
0.71
3.
42
0.10
0.
49
4.73
0.
20
0.00
0.
71
-1.3
9 0.
10
0.49
3.
49
0.10
0.
00
0.71
0.
13
0.10
0.
49
5.93
0.
10
0.00
0.
71
0.80
0.
10
0.49
5.
75
0.10
0.
00
0.71
0.
12
0.10
0.
49
4.55
0.
10
0.00
0.
00
0.60
0.
10
0.10
4.
09
- 0.
27
0.00
0.
00
2.66
0.
10
0.10
3.
47
-0.1
7 0.
00
0.00
2.
33
0.10
0.
10
5.62
-0
.17
0.00
0.
71
1.84
0.
10
0.49
5.
08
0.20
0.
00
0.71
2.
22
0.10
0.
49
5.77
0.
10
0.00
0.
00
2.27
0.
10
0.10
5.
62
-0.1
7 0.
00
0.00
0.
20
0.10
0.
10
2.47
-0
.17
0.00
1.
71
0.00
0.
10
4.77
0.
10
0.25
0.
00
0.71
3.
66
0.10
0.
49
4.39
0.
20
0.00
0.
00
5.08
0.
10
0.10
5.
39
-0.1
7 0.
00
0.71
4.
13
0.10
0.
49
4.39
-0
.17
0.00
3.
50
0.00
0.
10
5.16
0.
10
0.12
0.
00
4.42
0.
00
0.10
5.
81
0.10
-
0.07
0.
00
0.00
4.
42
0.10
0.
10
5.66
-0
.17
0.00
0.
71
2.42
0.
10
0.49
4.
48
0.10
0.
00
4.42
0.
00
0.10
5.
81
0.10
-
0.07
0.
00
0.00
1.
61
0.10
0.
10
4.91
-0
.27
0.00
0.
71
2.15
0.
10
0.49
6.
53
0.10
0.
00
2.01
0.
00
0.10
2.
97
0.10
-
0.08
0.
00
0.00
3.
66
0.10
0.
10
4.39
-0
.17
0.00
0.
71
1.20
0.
10
0.49
6.
21
0.10
0.
00
0.81
0.
00
0.10
4.
83
0.10
-
0.07
0.
00
0.00
2.
33
0.10
0.
10
5.62
-0
.17
0.00
0.
71
3.71
0.
10
0.49
5.
16
0.20
0.
00
5.08
0.
00
0.10
5.
35
0.10
-
0.07
0.
00
0.00
1.
84
0.10
0.
10
5.08
-0
.17
0.00
0.
00
1.28
0.
10
0.10
5.
62
-0.1
7 0.
00
0.71
3.
42
0.10
0.
49
4.73
0.
20
0.00
0.
00
2.01
0.
10
0.10
3.
00
-0.0
9 0.
00
1.54
0.
00
0.10
4.
70
0.10
-
0.07
0.
00
0.71
3.
28
0.10
0.
49
6.64
0.
10
0.00
0.
00
1.31
0.
10
0.10
4.
69
-0.1
7 0.
00
0.00
1.
75
0.10
0.
10
5.84
-0
.17
0.00
0.
71
4.09
0.
10
0.49
7.
19
0.10
0.
00
0.55
0.
00
0.10
4.
94
0.10
-
0.07
0.
00
3.71
0.
00
0.10
5.
32
0.10
-
0.07
0.
00
3.71
0.
00
0.10
5.
32
0.10
-
0.07
0.
00
2.71
0.
00
0.10
4.
39
0.10
-
0.07
0.
00
0.00
1.
11
0.10
0.
10
5.16
-0
.17
OO
OO
lOf
0000
10g
1 0
0 0
0 0
b 1
OO
OO
Ob
1 O
OO
OO
b 1
0 0
0 0
0 b
1 O
OO
OO
b O
OO
OlO
e 1
OO
OO
lb
0 0
0 0
0 1
e 1O
OO
OO
b 1
0000
1c
1 O
OO
OO
b 1
OO
OO
Ob
0000
10g
OO
OO
lOf
1 O
OO
lOb
OO
OO
OO
e 0
0 0
0 1
0 g
OO
OO
lOe
oooo
oog
OO
OO
lOe
1 O
OO
OO
b 1
OO
OO
Ob
1 0
0 0
1 0
b O
OO
OlO
e 1
OO
OO
Ob
OO
OO
Ole
1
0000
1c
0000
10g
OO
OO
lOe
1OO
OO
lb
1 O
OO
Olb
O
OO
OlO
e 1
OO
OlO
b O
OO
OO
le
oooo
oog
1 O
OO
OO
b 1
OO
OO
lb
1 O
OlO
Od
0000
01
0000
10:
OO
OO
lOf
OO
OO
lOe
1 00
001c
Tabl
e 2
(con
tinu
ed)
217
3-C
1,4-
(CH
,),C
,HA
’-C1,
3’-S
O,F
8.
11
218
3-C
1~4-
OC
$$-I
,~3’
-CO
N(M
ek~H
~ 8.
12
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
3-O
(CH
,),O
C,H
,-4’-N
HC
OC
H,B
r 3-
C1,
4-O
CH
,C,H
,-3’-C
ON
Et,
3-C
1,4-
(CH
,),C
,H,-4
’-SO
,F
3-B
r,4-
OC
H,C
ON
H-C
sH,-4
’-SO
,F
4-(C
H,),
OC
,H,-4
’-SO
,F
3-(C
H,)&
H,
3-C
H2N
HC
ON
HC
,H,-3
’-CN
3-
C1,
4-O
CH
,C,H
,-4’-S
O,O
C,H
, 3-
C1,
4-(C
H,),
C,H
,-3’-C
1,2’
-SO
,F
4-(C
H,),
CO
NH
C,H
,-2’-M
e,4’
-SO
,F
4-(C
H,),
CO
NH
C,H
,-4’-O
Me,
3’-S
O,F
3-
C1,
4-O
CH
,C,H
,-4’-S
O,C
,H,-3
”-C
N
4-(C
H,),
OC
,H,
3-C
14-O
CH
C
H -
4’-S
O C
H -
3” 4
”-C
l 3-
(C;I,
,,C,;I
,6~~
-‘NH
CO
~~~B
~ ’
2
3-C
1,4-
(CH
,),C
,H,-4
’-C1,
3’-S
O,F
3-
C1,
4-(C
H,),
C,H
,-3’-C
1,2’
-SO
,F
3-C
1,4-
(CH
,),C
,H,-2
’-C1,
4’-S
O,F
3-
C1,
4-O
CH
C H
2
6 4 -4
’-SO
C H
-2”
-CF
3 6
4 3
3-(C
H,),
OC
,H,
3-(C
H,),
C,H
,
3-(C
H,),
C,H
,-4’-C
1,3’
-SO
,F
3-(C
H,),
C,H
,-4’-N
HC
OC
CH
,Br
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C,H
,-4”-
CN
3-
C&
4-0C
H2C
6H4-
4’-S
O,C
,H,-4
”-O
CH
, 3-
C1,
4-O
CH
,C,H
,-4’-S
O,C
sH,-4
”-F
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C,H
,-2”-
OC
H,
3-(C
H,),
C,H
,-3’-N
HC
OC
H,B
r 3-
C1,
4-O
CH
,C,H
,-4’-S
O,C
,H,-3
”-C
H,
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C,H
,-3”-
F 3-
C1,
4-O
CH
2C6H
4-4’
-SO
,C,H
,-3”-
0CH
3 3,
4-C
l, 3-
C1,
4-0C
H2C
6H4-
4’-S
O,C
,H,-2
”-C
l 3-
C1,
4-O
CH
,C,H
,-4’-S
O,C
,H,-4
”-C
ON
(CH
,),
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C,H
,-4”-
CO
N(C
H,),
3-
C1,
4-O
CH
,C,H
,-4’-S
O,C
,H,-2
”-C
N
3-C
1,4-
OC
H2C
6H4-
4’-S
O,C
,H,-2
”-F
3-C
1,4-
OC
H,C
,H,-4
’-SO
,C,H
,-3”-
CO
N(C
H,),
8.13
8.
14
8.14
8.
14
8.14
8.
19
8.19
8.
20
8.20
8.
24
8.24
8.
24
8.24
8.
25
8.26
8.
27
8.30
8.
33
8.33
8.
35
8.35
8.
37
8.38
8.
39
8.40
8.
40
8.40
8.
41
8.44
8.
46
8.52
8.
54
8.62
8.
62
8.63
8.
70
8.74
8.
76
0.00
0.
71
4.42
0.
10
0.49
5.
66
0.20
0
0 0
0 1
0 f
0.00
0.
71
2.15
0.
10
0.49
6.
99
0.10
1
0 0
0 0
0 b
0.00
1.
27
0.00
0.
10
5.85
0.
10
0.12
1
OO
OO
Od
0.00
0.
71
1.15
0.
10
0.49
5.
95
0.10
1
OO
OO
Od
0.00
0.
71
3.71
0.
10
0.49
5.
16
0.20
0
0 0
0 1
0 f
0.00
0.
86
1.61
0.
10
0.78
4.
91
0.12
0
0 0
0 0
0 e
0.00
0.
00
4.62
0.
10
0.10
5.
37
-0.1
7 1
OO
OlO
b 0.
00
2.66
0.
00
0.10
5.
64
0.10
-0
.07
0 0
0 0
1 0
e 0.
00
0.26
0.
00
0.10
4.
69
0.10
-
0.07
O
OO
OO
le
0.00
0.
71
3.21
0.
10
0.49
6.
79
0.10
1
0 0
1 0
0 b
0.00
0.
71
4.42
0.
10
0.49
5.
66
0.20
0
0 0
0 1
0 g
0.00
0.
00
2.33
0.
10
0.10
5.
62
-0.1
7 1O
OO
Old
0.
00
0.00
1.
75
0.10
0.
10
5.84
-0
.17
1 O
OO
Old
0.
00
0.71
2.
64
0.10
0.
49
7.32
0.
10
1 0
0 1
0 0
c 0.
00
0.00
3.
61
0.10
0.
10
4.61
-0
.17
0 0
0 0
1 0
g 0.
00
0.71
4.
63
0.10
0.
49
7.79
0.
10
1 0
0 1
0 0
b 0.
00
2.29
0.
00
0.10
5.
55
0.10
-0
.07
1 O
OO
lOb
0.00
0.
71
3.42
0.
10
0.49
4.
73
0.20
0
0 0
0 1
0 f
0.00
0.
71
3.42
0.
10
0.49
4.
73
0.20
0
0 0
0 1
0 g
0.00
0.
71
3.42
0.
10
0.49
4.
73
0.20
0
0 0
0 1
0 f
0.00
0.
71
4.09
0.
10
0.49
7.
19
0.10
1
0 0
1 0
0 b
0.00
3.
61
0.00
0.
10
4.52
0.
10
-0.0
7 0
0 0
0 1
0 f
0.00
3.
66
0.00
0.
10
4.37
0.
10
-0.0
7 0
0 0
0 1
0 e
0.00
4.
42
0.00
0.
10
5.81
0.
10
-0.0
7 0
0 0
0 1
0 g
0.00
3.
24
0.00
0.
10
6.47
0.
10
-0.0
7 1
OO
OlO
d 0.
00
0.71
2.
64
0.10
0.
49
7.32
0.
10
1 0
0 1
0 0
b 0.
00
0.71
3.
19
0.10
0.
49
7.47
0.
10
1 O
OlO
Ob
0.00
0.
71
3.35
0.
10
0.49
6.
78
0.10
1
0 0
1 0
0 d
0.00
0.
71
3.19
0.
10
0.49
7.
47
0.10
1
0010
0c
0.00
3.
24
0.00
0.
10
6.47
0.
10
- 0.
07
1OO
OlO
b 0.
00
0.71
3.
77
0.10
0.
49
7.25
0.
10
1 0
0 1
0 0
b 0.
00
0.71
3.
35
0.10
0.
49
6.78
0.
10
1 0
0 1
0 0
d 0.
00
0.71
3.
19
0.10
0.
49
7.47
0.
10
1 0
0 1
0 0
d 0.
00
0.71
0.
71
0.10
0.
49
0.60
0.
60
1 O
OO
OO
b 0.
00
0.71
3.
92
0.10
0.
49
7.29
0.
10
1 0
0 1
0 0
c 0.
00
1.71
1.
70
0.10
0.
49
8.59
0.
10
1 O
OlO
Ob
0.00
0.
71
1.70
0.
10
0.49
8.
59
0.10
10
0100
c 0.
00
0.71
2.
64
0.10
0.
49
7.32
0.
10
1 0
0 1
0 0
d 0.
00
0.71
3.
35
0.10
0.
49
6.78
0.
10
1 0
0 1
0 0
c 0.
00
0.71
1.
70
0.10
0.
49
8.59
0.
10
1 O
OlO
Ob
F.R. Burden et al. / Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 135
Table 3 Results from training a 5:8:1 network with log l/C scaled in the range 0.2-0.8 for the 132 sub-set of the Walker 256 leukemia strain
“4 Mb MR4 W.4 log 1 / C GA search bounds
-0.53 - 1.38 5.5 75.5 - 0.20 10.06 range of training set
-1.44 -0.27 25.2 89.0 0.34 10.09 training set + 10%
0.95 5.47 13.5 96.3 -0.72 10.12 chemical bounds
We investigated constraining the search in three separate ways. Firstly, the parameter space was con- strained to the parameter range of the training set. Secondly, the search region was limited to the pa- rameter range of the training set plus 10%. Finally, the region was constrained to cover only chemically reasonablevalues(-l<a<lS, -2<~<6,0< MR < 100). It should be noted that the MR values quoted in Andrea and Kalayeh [ 11 are 0.1 times the true values in order to balance the range of the input parameters to the neural network.
The combination of scaling log l/C to the range 0.2-0.8 and expanding the search range to chemi- cally accessible values was considered to be optimal for this problem. Narrowing the scaling more than this runs the risk of entering the linear region of the transfer function. Going too far outside the range of the training set runs the risk of extrapolation into os- cillatory behavior of the overall net function as well as into chemically unrealistic regions.
3. Results and discussion
3.1. Comparison with previous multiple linear re- gressions
Silipo and Hansch [6] carried out a multiple linear regression analysis of the DHFR data for both tumor lines and obtained the following QSAR:
log l/C = 0.6807r, - 0.118~; + 0.230MR,
- O.O243MRz, + 0.2381, - 2.5301,
- 1.99113 + 0.8771, + 0.6861,
+ 0.7041, + 6.489
N = 244, s = 0.377, R = 0.923
where the T and MR values have the previous mean- ings and the six indicator variables I, _6 represent the following biological or structural features: Zi = 1 (Walker 256 cell line), = 0 (L1210 cell line); Z, = 1 for non-hydrogen R, substituent; Z3 = 1 for rigid groups attached to N-phenyl ring; Z4 = 1 for con- geners containing the highly active leaving group C,H,SO,OC,H,X; I, = 1 for conformationally flexible bridges between the N-phenyl ring and a second phenyl ring; Z, = 1 for bridges of the type CH,NHCONHC6H,X, CH$H,CON(R)C,H,X, and CH,CH2CH,CON(R)C6H,X (R = H, Me) when these groups are attached at the 3 or 4 position of the N-phenyl ring. Note that Silipo and Hansch [6] excluded 12 compounds as outliers.
It is difficult to make direct comparisons of the maxima found by the MLRI method and the work
Notes to Table 2: a: I = 1; compounds with a non-hydrogen sub&tent at R, and not considered here. b: Zt = 0; 57 compounds used in the training set, taken from the 113 compounds in the L1210 set. c: I, = 0; 21 compounds used in the validation set, taken from the 113 compounds in the L1210 set. d: I, = 0; 35 compounds used in the testing set, taken from the 113 compounds in the L1210 set. e: Z, = 1; 100 compounds used in the training set, taken from the 132 compounds in the Walker set. f: II = 1; 14 compounds used in the validation set, taken from the 132 compounds in the Walker set. g: I, = 1; 18 compounds used in the testing set, taken from the 132 compounds in the Walker set.
136 F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137
Table 4 Results from training a 5:8:1 network with log l/ C scaled in the range 0.2-0.8 for the 113 sub-set of the L1210 cell line
973 “4 MR3 MR4 Ca3.4 log l/C GA search bounds
4.32 3.51 7.46 10.8 -0.24 9.66 range of training set
4.03 4.51 5.70 4.2 -0.21 9.70 training set + 10%
5.34 - 1.88 32.2 15.3 -0.91 9.77 chemical bounds
the 113 values from the Walker 256 cell line. The neural net training resulted in a root-mean-square er- ror, RMSE, in the test set of 0.06 and an RMSE for the validation set of 0.075. The latter corresponds to around 13.1% of the mean scaled value of 0.5727. The genetic algorithm search of this surface (which is essentially an interpolation) located maxima, which occurred close to the highest of the input data of 8.76 and (corresponding to 0.99 in the scaled data).
3.3. Extrapolation runs
presented here since the use of the six indicator vari- ables split the data set into 2’j = 64 subsets of which the indicator set giving the largest log l/C has been selected below. The work here only makes use of one indicator variable, Z,, and these two sub-sets, known as the Walker 256 cell line and L1210 cell line, were trained and tested separately.
The regression analysis gives the optimum values of 7~s and MR, as 2.88 and 4.7 respectively. The maximum log l/C is therefore 9.13 (Walker 256 cell line) and 8.89 (L1210 cell line), assuming Z4 = 1, Z5 = Z6 = 0. While the correlation is statistically satis- factory, interpretation of the results is difficult, with a significant percentage of the variance being ex- plained by the indicator variables, rather than the more chemically relevant substituent constants.
In order to minimize any saturation effects that the sigmoidal transfer function might have on the ability of the network to extrapolate, the neural network was trained with the log l/C scaled to between 0.2 and 0.8. The resultant neural network had an RMSE of 0.05, which corresponds to 8.9% of the expected, mean scaled value of 0.5727. The genetic algorithm searches were repeated and the results summarized in Tables 3 and 4. The Tables show that there is a con- siderable variability in rrd and CF~,~ which is con- sistent with the flat response surfaces shown by An- drea and Kalayeh [ 1 I in their analysis of the paramet- ric sensitivities of the biological response to the indi- vidual parameters. Silipo and Hansch also found that these parameters were not statistically significant in their multiple linear regression analysis [6].
3.2. Consistency check 3.4. Interpretation of extrapolated runs
The results from the genetic algorithm searches are shown in Tables 3 and 4. An internal consistency check using the neural network trained with log l/C scaled between 0.01 and 0.99 was run initially using
As noted by Andrea and Kalayeh [6] the slopes of the response surfaces are sensitive to the values of all five independent variables suggesting substantial in- ter-variable couplings which neural networks are able to take into account. Clearly the curvature of the in-
Table 5 Compounds predicted to have a high DHFX inhibitory properties
Cell line R, 573 MR3 R4 774 MR4 Ea3.4 Predicted log 1 /C
Walker 256 GA values 0.95 13.5 GA values 5.47 96.3 -0.72 10.12 Walker 256 -CzH, 1.02 10.3 -OCH,C,H,-4’-O&H, 3”,4”Cl, 4.63 77.9 -0.34 10.12 L1210 GA values 5.34 32.2 GA values -1.88 15.3 -0.91 9.79 L1210 CH,Si(C,H,), 3.26 43.5 NHSO,CH, - 1.18 18.2 - 0.42 9.35
-OCH,CON(CH,CH,)Cl - 1.39 34.9 - 0.43 9.27 CH,CH,COOH -0.29 16.5 -0.28 9.14 cyclopropyl 1.14 13.5 - 0.42 9.10 N-propyl 1.55 15.0 -0.34 8.89
F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 137
dividual response surfaces depends on the values of containing the maximum will be compressed by the the other variables but near the response maximum nd neural network to fit within the boundaries of the
and Cc,, are flat and are therefore uninformative. sigmoid function, thus obscuring the height and loca- Using the values of the parameters predicted for tion of the maximum.
the search over the chemically accessible region, the data set of Andrea and Kalayeh [l] and the tables of Hansch and Leo [l l] were searched for appropriate substituents. Given the flatness of the response sur- face to rTT3 and CUE,, we have given most weight to the other parameters in the selection of substituents. Table 5 summarizes the results, suggesting sub- stituents having parameters close to the required op- timum. The predicted value of optimum log l/C for the Walker 256 leukemia tumors is 10.1. This is con- siderably higher than the maximum of the training dataset of 8.74.
Similar calculations were carried out for the 113 sub-set of Andrea and Kalayeh [l] relating to the L1210 cell line. The results, shown in Table 5, indi- cate that a range compounds having optimum sub- stituents would exhibit a log l/C value of from 8.89 to 9.35 which are also higher than the training set maximum of 8.37.
Genetic algorithms show some promise in solving the neural network inversion problem for QSAR and may indeed be useful in finding the general position of maxima of many QSAR surfaces and thereby to help in predicting which substituents are likely to give rise to higher biological activity. It is important to limit the search to a chemically reasonable parameter space. Several questions for further study show that more work is needed to determine how the shape of the activity surface affects the genetic algorithms performance; how well the genetic algorithm is able determine the global maximum on the activity sur- face and what effect linear dependence has on the capacity of genetic algorithms to find maxima. There is some evidence to suggest genetic algorithms may perform better on linearly independent data even though linear dependence is of little consequence to a neural network. Work in this area is continuing, and inversion of the neural nets by backpropagation of errors to modify the molecular parameters of an ex- isting active compound is the focus.
4. Conclusions
Our study as shown, as has previous work on QSAR using neural nets, that indicator variables may be dispensed with in neural net structure-activity studies. As the majority of the variance in Silipo and Hansch’s [6] QSAR analysis is accounted for by the indicator variables, neural networks offer consider- able advantages over MLRI in this respect.
References
[I] T.A. Andrea, H.J. Kalayeh, J. Med. Chem. 34 (1991) 2824- 2836.
[2] D.T. Manallack, D.J. Livingstone, Med. Chem. Res. 2 (1992) 181-190.
[3] S.-S. So, W.G. Richards, J. Med. Chem. 35 (1992) 3201- 3207.
The genetic algorithm was able to predict values of the biological activity that were higher than those in the training set for both leukemia sub-sets. It is difficult to assess whether the genetic algorithm has found the real maxima since the neural network is the only satisfactory model of the dataset. It is interest- ing to note that the regions in which the genetic al- gorithm found a maximum vary quite markedly ac- cording to the region searched. Unless sufficient scaling latitude is given it is possible that the region
[4] D.W. Salt, N. Yildiz, D.J. Livingstone, C.J. Tinsley, Pestic. Sci. 36 (1992) 161-170.
[5] S. Forest, Science 261 (1993) 872-878. [6] C. Silipo, C. Hansch, J. Am. Chem. Sot. 97 (1975) 6849. [7] R.T. Kroemer, P. Hecht, J. Cornput-Aided Mol. Des. 9 (1995)
396-406. [S] J.D. Hirst, R.D. King, M.J.E. Stemberg, J. Cornput-Aided
Mol. Des. 8 (1994) 421-432. [9] Propagator, ARD Corporation, 1993, Columbia, U.S.A.
[lo] Evolver Ver.2.1., Ax&is Inc., 1994, Seattle, U.S.A. [ll] C. Hansch, A. Leo, Substituent Constants for Correlation
Analysis in Chemistry and Biology, J. Wiley and Sons, Bris- bane, 1979.