predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms

Chemometricsand intelligent

i laboratory systems ELSEVIER Chemomemcs and Intelligent Laboratory Systems 38 (1997) 127-137

Predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms

Frank R. Burden a, Brendan S. Rosewarne a, David A. Winkler b, * a Chemistry Depamnt, Monash University, Clayton, Victoria 3168, Australia

b CSIRO Division of Molecular Science, Private Bag IO, Clayton, Victoria 3168, Australia

Received 10 November 1996; accepted 28 April 1997

Abstract

Recently neural networks have been applied with some success to the study of quantitative structure activity relationships. One limitation of their use is that, while they are able to predict the biological activity of a new molecule from its physicochemical properties, it is difficult to get them to solve the more interesting problem of predicting the required molecular properties of a more active molecule. This paper proposes one method for solving this problem by using genetic algorithms and explores their potential as a method for solving this problem. Suggestions for more potent dihydrofolate reductase inhibitors are made. 0 1997 Elsevier Science B.V.

Keywords: Neural network; Genetic algorithm; QSAR; DHFR inhibition; Drug design; Activity prediction

1. Introduction

A relatively recent and interesting development in the field of QSAR has been the application of artifi- cial neural networks (ANNs) to the discovery of structure-activity mappings. Recent work has shown that ANNs do indeed show promise in this application [l-3].

ANNs provide several benefits when compared with conventional multivariate QSAR methods. Firstly, they do not require any assumptions to be made about the nature of the functional relationships between molecular descriptors and biological activity. Secondly, they are generally able to function well without the use of indicator variables often required for conventional regression analyses [4]. Finally, they

* Corresponding author. E-mail: [email protected].

are often capable of producing better generalizations, since they are able to model a complicated activity surface better than conventional methods.

ANNs do have drawbacks as they require relatively large data sets to train effectively, they can be over trained (leading to loss of generalization), may require long training times, and they are essentially a black-box with much of the details of how they function hidden from the user. This last drawback makes it difficult to study the surfaces they create or esti- mate the limits of their accuracy. Thus, while it is possible to use neural networks to predict the activity of a molecule given its physicochemical properties, the inverse problem of finding the values of the independent variables which give the best possible activity, is much more difficult. This paper explores the use of genetic algorithms 151 as a means for finding such maxima on structure-activity surfaces and draws

0169-7439/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved. PII SO169-7439(97)00052-X

Tabl

e 1

Stru

ctur

e, ex

peri

men

tally

dete

rmin

ed D

HFR

inh

ibit

ory

acti

vity

, and

phy

sico

chem

ical

pro

pert

ies o

f di

amin

odih

ydro

tria

zine

s

# R

lo

gl/C

r2

%

-3

rr4

MR

, M

R,

MR

, E

u~_~

1,

I2

I3

I4

I5

Is

Se

t

1 2,

5-C

l, 3.

43

2 2-

OC

H,

3.68

3

2,4X

1,

3.82

4

2-C

H 3

4.

00

5 2-

Cl

4.15

6

2-B

r 4.

25

7 2,

4,5-

C],

4.38

8

2-I

4.62

9

4-C

ON

HC

sH,-4

’-SO

,F

4.68

10

4-

CO

NH

CsH

,-3’-S

O,F

4.

68

11

4-C

,H,

4.70

12

2-

F 4.

74

13

3-O

CH

,CO

-N-(C

H$H

,),O

4.

85

14

4-C

N

5.14

15

4-

CH

=CH

CO

NH

-C,H

,-4’-S

O,F

5.

19

16

3-0C

H2C

ON

Me,

5.

44

17

4-C

H(P

hkH

,CO

NH

-C,H

,-4’-S

O,F

5.

74

18

4-C

1,3-

(CH

,)&H

,-4’-S

O,

5.82

19

4-

CH

=CH

CO

NH

-C,H

,-3’-S

O,F

5.

89

20

3-C

ON

HC

,H,-4

’-SO

,F

5.96

21

3-

NH

CO

CH

,Br,

4-O

(CH

,),C

,H,

6.11

22

3-

CH

,NH

CO

NE

t, 6.

11

23

3-O

CH

, 6.

17

24

4-O

CH

,CO

N(M

ek,H

, 6.

17

25

4-C

H&

H(C

H,C

H,P

h)-C

ON

HC

,H,-4

’-SO

*F

6.20

26

3-

CO

CH

,Cl

6.21

27

4-

CH

,CH

(a-C

,,H,)-

CO

NH

CsH

,-4’-S

O,F

6.

24

28

4-O

CH

,CO

NM

e,

6.26

29

4-

CH

&H

-(Ph-

2”-O

CH

,kO

NH

C,H

,-4’-S

O,F

6.

33

30

3-C

&4-

OC

H&

H,,-

CH

20C

,H,-4

’-SO

zF

6.37

31

3-

CH

(CH

,NH

CO

-CH

,BrX

CH

.&C

SH5

6.37

32

3-

CH

,NH

CO

-N(C

H,C

H,)2

0 6.

43

33

4-C

OC

H,C

l 6.

45

34

4-C

HJH

(Ph-

3”-O

CH

,kO

NH

C,H

,-4’-S

O,F

6.

46

35

4-C

H(C

H,N

HC

O-C

H,B

rXC

H,),

C,H

S 6.

52

36

2,3-

Cl,

6.52

37

2-

Cl,4

-(CH

,),C

,HS

6.54

38

3-

Cl,4

-O(C

H,),

O-C

,H,-4

’-SO

,C,H

,-4”-

Cl

6.55

39

3-

CH

,NH

CO

CH

2Br

6.58

40

3-

CO

NH

CsH

,-3’-S

O,F

6.

60

41

4-C

H,C

ON

Me,

6.

63

42

4-O

CH

,CO

-N(C

H2)

, 6.

66

0.71

0.

00

0.00

0.

60

0.10

0.

10

0.37

1

1 0

0 0

0 a

-0.0

2 0.

00

0.00

0.

79

0.10

0.

10

0.00

1

1OO

OO

a 0.

71

0.00

0.

71

0.60

0.

10

0.60

0.

23

1 1

0 0

0 0

a 0.

56

0.00

0.

00

0.57

0.

10

0.10

0.

00

1 1

0 0

0 0

a 0.

71

0.00

0.

00

0.60

0.

10

0.10

0.

00

1 1

0 0

0 0

a 0.

86

0.00

0.

00

0.89

0.

10

0.10

0.

00

1 1

0 0

0 0

a 0.

71

0.00

0.

71

0.60

0.

10

0.60

0.

60

1 1

0 0

0 0

a 1.

12

0.00

0.

00

1.39

0.

10

0.10

0.

00

1 1

0 0

0 0

a 0.

00

0.00

1.

50

0.10

0.

10

4.23

0.

36

1 0

1 0

0 0

b 0.

00

0.00

1.

50

0.10

0.

10

4.23

0.

36

1 0

1 0

0 0

d 0.

00

0.00

1.

96

0.10

0.

10

2.54

-0

.01

1 0

1 0

0 0

b 0.

14

0.00

0.

00

0.09

0.

10

0.10

0.

00

1 1

0 0

0 0

a 0.

00

- 1.

39

0.00

0.

10

3.32

0.

10

0.12

1

0 0

0 0

0 b

0.00

0.

00

-0.5

7 0.

10

0.10

0.

63

0.66

1O

OO

OO

b 0.

00

0.00

1.

99

0.10

0.

10

5.22

-0

.01

1 0

1 0

0 0

b 0.

00

- 1.

36

0.00

0.

10

2.41

0.

10

0.12

1O

OO

OO

b 0.

00

0.00

3.

53

0.10

0.

10

7.59

-0

.09

1 0

1 0

0 0

b 0.

00

2.71

0.

71

0.10

4.

39

0.60

0.

16

0 0

0 0

1 0

g 0.

00

0.00

1.

99

0.10

0.

10

5.22

-0

.01

1 0

1 0

0 0

b 0.

00

1.50

0.

00

0.10

4.

33

0.10

0.

35

1 O

lOO

Od

0.00

-

0.37

2.

66

0.10

2.

11

4.15

-0

.27

1OO

OO

Ob

0.00

-

0.29

0.

00

0.10

3.

56

0.10

-0

.07

0 0

0 0

0 0

e 0.

00

- 0.

02

0.00

0.

10

0.62

0.

10

0.12

1

0 0

0 0

0 b

0.00

0.

00

0.12

0.

10

0.10

4.

55

-0.2

7 1

0 0

0 0

0 b

0.00

0.

00

4.23

0.

10

0.10

8.

52

-0.1

7 1

0 0

0 0

0 b

0.00

-0

.16

0.00

0.

10

1.45

0.

10

0.38

1

OO

OO

Ob

0.00

0.

00

5.02

0.

10

0.10

9.

13

-0.1

7 0

0 0

0 0

0 e

0.00

0.

00

-1.3

6 0.

10

0.10

2.

58

-0.2

7 1

0 0

0 0

0 b

0.00

0.

00

3.51

0.

10

0.10

8.

27

-0.1

7 0

0 0

0 0

0 g

0.00

0.

71

5.16

0.

10

0.49

7.

25

0.10

0

0 0

0 0

0 e

0.00

2.

94

0.00

0.

10

6.94

0.

10

-0.0

7 1

0 0

0 0

0 b

0.00

-

1.32

0.

00

0.10

3.

53

0.10

-0

.07

0 0

0 0

0 0

e 0.

00

0.00

-0

.16

0.10

0.

10

1.62

0.

50

1 0

0 0

0 0

b 0.

00

0.00

3.

51

0.10

0.

10

8.27

-0

.17

0 0

0 0

0 0

e 0.

00

0.00

2.

94

0.10

0.

10

7.03

-0

.17

1 O

OO

OO

b 0.

7 1

0.71

0.

00

0.60

0.

49

0.10

0.

37

1 1

0 0

0 0

a 0.

71

0.00

3.

66

0.60

0.

10

4.39

-0

.17

1 1

0 0

1 0

a 0.

00

0.71

4.

92

0.10

0.

49

8.90

0.

10

0 0

0 0

0 0

e 0.

00

-0.5

2 0.

00

0.10

2.

57

0.10

-0

.07

1 O

OO

OO

b 0.

00

1.50

0.

00

0.10

4.

33

0.10

0.

35

1 0

1 0

0 0

d 0.

00

0.00

-1

.70

0.10

0.

10

2.37

-0

.17

1 0

0 0

0 0

b 0.

00

0.00

-0

.72

0.10

0.

10

3.31

-0

.27

1 0

0 0

0 0

b

F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 129

C,9P~)PQe,OPcoO04a,C,P~oOL,~c99C,PC,b,o9cMC,QMCIP”oc~~cgPu

0000000000000000000000000000000000-0000000000

00000-000000000000 -000000~0000000000000000000

000000000000000000000000000000000000000000000

00000000-000000000000000000000000000000000000

000000000000000000000000000000000000000000000

~~~0~00~~0000-----00-1-13oo-oo-oo333100-00-00----00000~0~

130 F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137

388888888888888888888888888888888888888888 jddddddddddddddddddddddddddddddddddddddd~~

F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 131

some conclusions about their usefulness and limita- tions.

2. Experimental

2.1. Biological data

Our data set of 256 5-phenyl-3,4-diamino-6,6-di- methyldihydrotriazine (1) dihydrofolate reductase (DHFR) inhibitors was obtained from the paper by Andrea and Kalayeh [l]. This comprises 132 compound sub-set relating to DHFR inhibition in Walker 256 carcinoma cells and the 113 compound sub-set relating to L12 10 tumor line (11 compounds had a non-hydrogen R, and were omitted from the analysis [l]). This DHFR inhibitor data set has been used as a de facto standard in several previous QSAR pa- pers (e.g., [6-81). In the Walker 256 data set, 100 compounds were assigned to the training set and 32 assigned to the test set on the basis of cluster analysis [l]. In the L1210 data set, 57 compounds were assigned to the training set and 56 assigned to the test set also on the basis of cluster analysis [l].

2.2. Molecular representation

The input parameters related to the physicochemical properties of substituents at the 3 and 4 positions on the phenyl ring (compounds with non-hydrogen substituents at R, were omitted from the analyses). These were: the hydrophobicities (rs, TTJ; molar refractivities (MR, and MR,); and the sum of the electronic parameters at positions 3 and 4 (Co, J. The full data set showing structures, biological activ- ities, substituent constants, and indicator variables, is given in Tables 1 and 2.

2.3. Neural network

In order to effect a direct comparison with earlier work [l], a 5:8:1 (i.e., 5 nodes in the input layer, 8 nodes in the hidden layer and one node in the output

layer) fully connected feed forward/back-propagation neural net was used. Previously eight hidden nodes were found to be the optimum number of nodes to minimize the variance for this dataset [2]. The data set was trained using Propagator [9] on a i486 desk- top computer. Transfer functions were linear for the input layers and sigmoidal for the hidden and output layers [l]. The training rate was 0.001 and the mo- mentum term 0.6.

Since Propagator is not programmed to carry out a full cross-validation study where each data point is, in turn, abstracted from the full set at each cycle, a partial validation was carried out. The partial validation was carried out by taking a randomly chosen sub-set of 15 data points from the test set of Ref. [l] to serve as a validation set; this was done seven times to check for outliers in the training but none were found. The final training, test and validation set, used to train the network to be searched by the genetic algorithm, are denoted in Tables 1 and 2. The number of training cycles was dictated by the validation curve i.e. when the validation error started to rise training was stopped and the weights saved.

In most neural net studies the biological data are scaled to the range 0.01-0.99 to accommodate the sigmoid transfer function in the output layer. How- ever, in order to give more latitude for extrapolation and avoid saturation of the sigmoid transfer function in the neural network, the log 1 /C values were scaled to the range 0.2-0.8 in our study; a narrower range over-restricts the squashing functionality of the sigmoid function.

2.4. Genetic algorithm

After training the neural network, the weights were used to derive a forward propagation formula. A commercial genetic algorithm program, Evolver [ 101 was then used to search the activity surface for maxima. Evolver uses 16 bit precision for real numbers which gives a precision of 6.5 decimal digits. In this genetic algorithm search a gene pool of 50 was used, with a crossover rate (probability of mating) of 0.5 or 50%. The mutation rate (probability of random bit mutation) was 0.06 or 6%. The stopping criterion was that the change in the last 100 generations was less than 0.0001. The GA program, Evolver, returns only the optimum solution so that sub-optimal solutions are not available for this report.

32 F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137

QQOccn~OOOar,ccoMoo~onon~nooMMc~cnnno~~~no~aoQaJ

00-1-0000000-0000000000000-~0000+00-0-0000

000000000000000 ---ooooooooooooooo~~ooooo~~

000000000000000000000~0000000000000

000000000000000000000000000000000000000000

000000000000000000000000000000000000000000

00004--3-1oo-ooooo-----0000-10000

172

3-C

1,4-

(CH

,),C

,H,-5

’-Cl,Z

’-SO

,F

7.85

17

3 3-

C1,

4-(C

H,),

C,H

,-3’-C

1,4’

-SO

,F

7.85

17

4 3-

C1,

4-O

CH

,CO

-N(C

H,C

H,),

O

7.85

17

5 3-

C1,

4-O

CH

&H

,-3’-C

ON

(CH

,CH

,),O

7.

85

176

3-C

1,4-

OC

H,C

,H,-3

’-CO

-N(C

H,)4

7.

85

177

3-C

1,4-

OC

H,C

ON

-(Me&

H5

7.89

17

8 4-

OC

H,C

ON

HC

,H,

7.89

17

9 4-

(CH

,),C

,H,

7.89

18

0 4-

(CH

,),C

ON

HC

sH,-3

’-Me,

4’-S

O,F

7.

89

181

3-C

1,4-

CH

,NH

CO

NH

-C,H

,-4’-S

O,F

7.

92

182

3-C

1,4-

O(C

H,),

NH

-CO

NH

C,H

,-4’-S

O,F

7.

92

183

4-(C

H&

ON

HC

sH,-3

’-SO

*F

7.92

18

4 4-

(CH

,),C

OC

H,C

l 7.

92

185

3-O

C,H

,-4’-N

HC

OC

H,B

r 7.

92

186

3-C

1,4-

(CH

,)&H

, 7.

92

187

4-(C

H,),

C,H

,-2’,4

’-Cl*

7.

92

188

3-C

1,4-

(CH

,),C

,H,

7.96

18

9 3-

O(C

H,)$

CsH

,-4’-S

O,F

7.

96

190

3-(C

H,),

C,H

,-5’-C

1,2’

-SO

,F

7.96

19

1 4-

(CH

,).,C

,H,-2

’-C1.

4’-S

O,F

7.

96

192

3-C

1,4-

OC

H,C

,H,-4

’-C1,

3’-S

O*F

8.

00

193

3-(C

H,&

H,-2

’-Cl,4

’-SO

,F

8.00

19

4 4-

OC

H&

ON

HC

,H,-3

’-SO

,F

8.00

19

5 3-

C1,

4-O

CH

&H

,-3’-C

ON

HC

,H,

8.00

19

6 3-

CH

,C,H

, 8.

00

197

4-(C

H&

,H,

8.00

19

8 3-

C1,

4-O

CH

,C,H

,-3’-C

O-N

(CH

z)5

8.02

19

9 3-

CH

,NH

CO

NH

C6H

,-3’-O

CH

, 8.

02

200

4-(C

H,),

CO

NH

C,H

,-4’-M

e,3’

-SO

,F

8.02

20

1 3-

C1,

4-@

H&

&H

,-3’-S

O,F

8.

03

202

3-(C

H,),

C,H

,-2’,4

’-C1,

8.

03

203

4-C

H,N

HC

ON

HC

,H,-3

’-SO

,F

8.04

20

4 4-

(CH

,)$O

N(M

e)-C

,H,-4

’-SO

,F

8.04

20

5 3-

C1,

4-(C

H,),

C,H

,-4’-C

l,Z’-S

O,F

8.

05

206

4-C

H,C

,H,

8.05

20

7 3-

CH

,NH

CO

NH

C6H

,-3’-C

l 8.

05

208

3-C

&4-

O(C

H,),

NH

-CO

NH

HC

,H,-3

’-Me,

3’-S

O,F

8.

06

209

4-C

H,C

ON

HC

,H,-3

’-S02

F 8.

06

210

4-(C

H,),

CO

NH

C,H

,-6’-O

Me,

3’-S

O,F

8.

08

211

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C6H

,-3”-

CF,

8.

09

212

3-C

H,N

HC

ON

HC

sH,-3

’-NO

? 8.

10

213

3-(C

H,)&

H,-4

’-SO

,F

8.10

21

4 3-

(CH

,),C

,H,-3

’-SO

,F

8.10

21

5 3-

(CH

,&H

,-4’-S

O,F

8.

10

216

4-(C

H2)

,NH

CO

C6H

,-4’-S

O,F

8.

11

0.00

0.

71

3.42

0.

10

0.49

4.

73

0.20

0.

00

0.71

3.

42

0.10

0.

49

4.73

0.

20

0.00

0.

71

-1.3

9 0.

10

0.49

3.

49

0.10

0.

00

0.71

0.

13

0.10

0.

49

5.93

0.

10

0.00

0.

71

0.80

0.

10

0.49

5.

75

0.10

0.

00

0.71

0.

12

0.10

0.

49

4.55

0.

10

0.00

0.

00

0.60

0.

10

0.10

4.

09

- 0.

27

0.00

0.

00

2.66

0.

10

0.10

3.

47

-0.1

7 0.

00

0.00

2.

33

0.10

0.

10

5.62

-0

.17

0.00

0.

71

1.84

0.

10

0.49

5.

08

0.20

0.

00

0.71

2.

22

0.10

0.

49

5.77

0.

10

0.00

0.

00

2.27

0.

10

0.10

5.

62

-0.1

7 0.

00

0.00

0.

20

0.10

0.

10

2.47

-0

.17

0.00

1.

71

0.00

0.

10

4.77

0.

10

0.25

0.

00

0.71

3.

66

0.10

0.

49

4.39

0.

20

0.00

0.

00

5.08

0.

10

0.10

5.

39

-0.1

7 0.

00

0.71

4.

13

0.10

0.

49

4.39

-0

.17

0.00

3.

50

0.00

0.

10

5.16

0.

10

0.12

0.

00

4.42

0.

00

0.10

5.

81

0.10

-

0.07

0.

00

0.00

4.

42

0.10

0.

10

5.66

-0

.17

0.00

0.

71

2.42

0.

10

0.49

4.

48

0.10

0.

00

4.42

0.

00

0.10

5.

81

0.10

-

0.07

0.

00

0.00

1.

61

0.10

0.

10

4.91

-0

.27

0.00

0.

71

2.15

0.

10

0.49

6.

53

0.10

0.

00

2.01

0.

00

0.10

2.

97

0.10

-

0.08

0.

00

0.00

3.

66

0.10

0.

10

4.39

-0

.17

0.00

0.

71

1.20

0.

10

0.49

6.

21

0.10

0.

00

0.81

0.

00

0.10

4.

83

0.10

-

0.07

0.

00

0.00

2.

33

0.10

0.

10

5.62

-0

.17

0.00

0.

71

3.71

0.

10

0.49

5.

16

0.20

0.

00

5.08

0.

00

0.10

5.

35

0.10

-

0.07

0.

00

0.00

1.

84

0.10

0.

10

5.08

-0

.17

0.00

0.

00

1.28

0.

10

0.10

5.

62

-0.1

7 0.

00

0.71

3.

42

0.10

0.

49

4.73

0.

20

0.00

0.

00

2.01

0.

10

0.10

3.

00

-0.0

9 0.

00

1.54

0.

00

0.10

4.

70

0.10

-

0.07

0.

00

0.71

3.

28

0.10

0.

49

6.64

0.

10

0.00

0.

00

1.31

0.

10

0.10

4.

69

-0.1

7 0.

00

0.00

1.

75

0.10

0.

10

5.84

-0

.17

0.00

0.

71

4.09

0.

10

0.49

7.

19

0.10

0.

00

0.55

0.

00

0.10

4.

94

0.10

-

0.07

0.

00

3.71

0.

00

0.10

5.

32

0.10

-

0.07

0.

00

3.71

0.

00

0.10

5.

32

0.10

-

0.07

0.

00

2.71

0.

00

0.10

4.

39

0.10

-

0.07

0.

00

0.00

1.

11

0.10

0.

10

5.16

-0

.17

OO

OO

lOf

0000

10g

1 0

0 0

0 0

b 1

OO

OO

Ob

1 O

OO

OO

b 1

0 0

0 0

0 b

1 O

OO

OO

b O

OO

OlO

e 1

OO

OO

lb

0 0

0 0

0 1

e 1O

OO

OO

b 1

0000

1c

1 O

OO

OO

b 1

OO

OO

Ob

0000

10g

OO

OO

lOf

1 O

OO

lOb

OO

OO

OO

e 0

0 0

0 1

0 g

OO

OO

lOe

oooo

oog

OO

OO

lOe

1 O

OO

OO

b 1

OO

OO

Ob

1 0

0 0

1 0

b O

OO

OlO

e 1

OO

OO

Ob

OO

OO

Ole

1

0000

1c

0000

10g

OO

OO

lOe

1OO

OO

lb

1 O

OO

Olb

O

OO

OlO

e 1

OO

OlO

b O

OO

OO

le

oooo

oog

1 O

OO

OO

b 1

OO

OO

lb

1 O

OlO

Od

0000

01

0000

10:

OO

OO

lOf

OO

OO

lOe

1 00

001c

Tabl

e 2

(con

tinu

ed)

217

3-C

1,4-

(CH

,),C

,HA

’-C1,

3’-S

O,F

8.

11

218

3-C

1~4-

OC

$$-I

,~3’

-CO

N(M

ek~H

~ 8.

12

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

3-O

(CH

,),O

C,H

,-4’-N

HC

OC

H,B

r 3-

C1,

4-O

CH

,C,H

,-3’-C

ON

Et,

3-C

1,4-

(CH

,),C

,H,-4

’-SO

,F

3-B

r,4-

OC

H,C

ON

H-C

sH,-4

’-SO

,F

4-(C

H,),

OC

,H,-4

’-SO

,F

3-(C

H,)&

H,

3-C

H2N

HC

ON

HC

,H,-3

’-CN

3-

C1,

4-O

CH

,C,H

,-4’-S

O,O

C,H

, 3-

C1,

4-(C

H,),

C,H

,-3’-C

1,2’

-SO

,F

4-(C

H,),

CO

NH

C,H

,-2’-M

e,4’

-SO

,F

4-(C

H,),

CO

NH

C,H

,-4’-O

Me,

3’-S

O,F

3-

C1,

4-O

CH

,C,H

,-4’-S

O,C

,H,-3

”-C

N

4-(C

H,),

OC

,H,

3-C

14-O

CH

C

H -

4’-S

O C

H -

3” 4

”-C

l 3-

(C;I,

,,C,;I

,6~~

-‘NH

CO

~~~B

~ ’

2

3-C

1,4-

(CH

,),C

,H,-4

’-C1,

3’-S

O,F

3-

C1,

4-(C

H,),

C,H

,-3’-C

1,2’

-SO

,F

3-C

1,4-

(CH

,),C

,H,-2

’-C1,

4’-S

O,F

3-

C1,

4-O

CH

C H

2

6 4 -4

’-SO

C H

-2”

-CF

3 6

4 3

3-(C

H,),

OC

,H,

3-(C

H,),

C,H

,

3-(C

H,),

C,H

,-4’-C

1,3’

-SO

,F

3-(C

H,),

C,H

,-4’-N

HC

OC

CH

,Br

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C,H

,-4”-

CN

3-

C&

4-0C

H2C

6H4-

4’-S

O,C

,H,-4

”-O

CH

, 3-

C1,

4-O

CH

,C,H

,-4’-S

O,C

sH,-4

”-F

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C,H

,-2”-

OC

H,

3-(C

H,),

C,H

,-3’-N

HC

OC

H,B

r 3-

C1,

4-O

CH

,C,H

,-4’-S

O,C

,H,-3

”-C

H,

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C,H

,-3”-

F 3-

C1,

4-O

CH

2C6H

4-4’

-SO

,C,H

,-3”-

0CH

3 3,

4-C

l, 3-

C1,

4-0C

H2C

6H4-

4’-S

O,C

,H,-2

”-C

l 3-

C1,

4-O

CH

,C,H

,-4’-S

O,C

,H,-4

”-C

ON

(CH

,),

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C,H

,-4”-

CO

N(C

H,),

3-

C1,

4-O

CH

,C,H

,-4’-S

O,C

,H,-2

”-C

N

3-C

1,4-

OC

H2C

6H4-

4’-S

O,C

,H,-2

”-F

3-C

1,4-

OC

H,C

,H,-4

’-SO

,C,H

,-3”-

CO

N(C

H,),

8.13

8.

14

8.14

8.

14

8.14

8.

19

8.19

8.

20

8.20

8.

24

8.24

8.

24

8.24

8.

25

8.26

8.

27

8.30

8.

33

8.33

8.

35

8.35

8.

37

8.38

8.

39

8.40

8.

40

8.40

8.

41

8.44

8.

46

8.52

8.

54

8.62

8.

62

8.63

8.

70

8.74

8.

76

0.00

0.

71

4.42

0.

10

0.49

5.

66

0.20

0

0 0

0 1

0 f

0.00

0.

71

2.15

0.

10

0.49

6.

99

0.10

1

0 0

0 0

0 b

0.00

1.

27

0.00

0.

10

5.85

0.

10

0.12

1

OO

OO

Od

0.00

0.

71

1.15

0.

10

0.49

5.

95

0.10

1

OO

OO

Od

0.00

0.

71

3.71

0.

10

0.49

5.

16

0.20

0

0 0

0 1

0 f

0.00

0.

86

1.61

0.

10

0.78

4.

91

0.12

0

0 0

0 0

0 e

0.00

0.

00

4.62

0.

10

0.10

5.

37

-0.1

7 1

OO

OlO

b 0.

00

2.66

0.

00

0.10

5.

64

0.10

-0

.07

0 0

0 0

1 0

e 0.

00

0.26

0.

00

0.10

4.

69

0.10

-

0.07

O

OO

OO

le

0.00

0.

71

3.21

0.

10

0.49

6.

79

0.10

1

0 0

1 0

0 b

0.00

0.

71

4.42

0.

10

0.49

5.

66

0.20

0

0 0

0 1

0 g

0.00

0.

00

2.33

0.

10

0.10

5.

62

-0.1

7 1O

OO

Old

0.

00

0.00

1.

75

0.10

0.

10

5.84

-0

.17

1 O

OO

Old

0.

00

0.71

2.

64

0.10

0.

49

7.32

0.

10

1 0

0 1

0 0

c 0.

00

0.00

3.

61

0.10

0.

10

4.61

-0

.17

0 0

0 0

1 0

g 0.

00

0.71

4.

63

0.10

0.

49

7.79

0.

10

1 0

0 1

0 0

b 0.

00

2.29

0.

00

0.10

5.

55

0.10

-0

.07

1 O

OO

lOb

0.00

0.

71

3.42

0.

10

0.49

4.

73

0.20

0

0 0

0 1

0 f

0.00

0.

71

3.42

0.

10

0.49

4.

73

0.20

0

0 0

0 1

0 g

0.00

0.

71

3.42

0.

10

0.49

4.

73

0.20

0

0 0

0 1

0 f

0.00

0.

71

4.09

0.

10

0.49

7.

19

0.10

1

0 0

1 0

0 b

0.00

3.

61

0.00

0.

10

4.52

0.

10

-0.0

7 0

0 0

0 1

0 f

0.00

3.

66

0.00

0.

10

4.37

0.

10

-0.0

7 0

0 0

0 1

0 e

0.00

4.

42

0.00

0.

10

5.81

0.

10

-0.0

7 0

0 0

0 1

0 g

0.00

3.

24

0.00

0.

10

6.47

0.

10

-0.0

7 1

OO

OlO

d 0.

00

0.71

2.

64

0.10

0.

49

7.32

0.

10

1 0

0 1

0 0

b 0.

00

0.71

3.

19

0.10

0.

49

7.47

0.

10

1 O

OlO

Ob

0.00

0.

71

3.35

0.

10

0.49

6.

78

0.10

1

0 0

1 0

0 d

0.00

0.

71

3.19

0.

10

0.49

7.

47

0.10

1

0010

0c

0.00

3.

24

0.00

0.

10

6.47

0.

10

- 0.

07

1OO

OlO

b 0.

00

0.71

3.

77

0.10

0.

49

7.25

0.

10

1 0

0 1

0 0

b 0.

00

0.71

3.

35

0.10

0.

49

6.78

0.

10

1 0

0 1

0 0

d 0.

00

0.71

3.

19

0.10

0.

49

7.47

0.

10

1 0

0 1

0 0

d 0.

00

0.71

0.

71

0.10

0.

49

0.60

0.

60

1 O

OO

OO

b 0.

00

0.71

3.

92

0.10

0.

49

7.29

0.

10

1 0

0 1

0 0

c 0.

00

1.71

1.

70

0.10

0.

49

8.59

0.

10

1 O

OlO

Ob

0.00

0.

71

1.70

0.

10

0.49

8.

59

0.10

10

0100

c 0.

00

0.71

2.

64

0.10

0.

49

7.32

0.

10

1 0

0 1

0 0

d 0.

00

0.71

3.

35

0.10

0.

49

6.78

0.

10

1 0

0 1

0 0

c 0.

00

0.71

1.

70

0.10

0.

49

8.59

0.

10

1 O

OlO

Ob

F.R. Burden et al. / Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 135

Table 3 Results from training a 5:8:1 network with log l/C scaled in the range 0.2-0.8 for the 132 sub-set of the Walker 256 leukemia strain

“4 Mb MR4 W.4 log 1 / C GA search bounds

-0.53 - 1.38 5.5 75.5 - 0.20 10.06 range of training set

-1.44 -0.27 25.2 89.0 0.34 10.09 training set + 10%

0.95 5.47 13.5 96.3 -0.72 10.12 chemical bounds

We investigated constraining the search in three separate ways. Firstly, the parameter space was constrained to the parameter range of the training set. Secondly, the search region was limited to the parameter range of the training set plus 10%. Finally, the region was constrained to cover only chemically reasonablevalues(-l<a<lS, -2<~<6,0< MR < 100). It should be noted that the MR values quoted in Andrea and Kalayeh [ 11 are 0.1 times the true values in order to balance the range of the input parameters to the neural network.

The combination of scaling log l/C to the range 0.2-0.8 and expanding the search range to chemically accessible values was considered to be optimal for this problem. Narrowing the scaling more than this runs the risk of entering the linear region of the transfer function. Going too far outside the range of the training set runs the risk of extrapolation into os- cillatory behavior of the overall net function as well as into chemically unrealistic regions.

3. Results and discussion

3.1. Comparison with previous multiple linear re- gressions

Silipo and Hansch [6] carried out a multiple linear regression analysis of the DHFR data for both tumor lines and obtained the following QSAR:

log l/C = 0.6807r, - 0.118~; + 0.230MR,

- O.O243MRz, + 0.2381, - 2.5301,

- 1.99113 + 0.8771, + 0.6861,

+ 0.7041, + 6.489

N = 244, s = 0.377, R = 0.923

where the T and MR values have the previous mean- ings and the six indicator variables I, _6 represent the following biological or structural features: Zi = 1 (Walker 256 cell line), = 0 (L1210 cell line); Z, = 1 for non-hydrogen R, substituent; Z3 = 1 for rigid groups attached to N-phenyl ring; Z4 = 1 for con- geners containing the highly active leaving group C,H,SO,OC,H,X; I, = 1 for conformationally flexible bridges between the N-phenyl ring and a second phenyl ring; Z, = 1 for bridges of the type CH,NHCONHC6H,X, CH$H,CON(R)C,H,X, and CH,CH2CH,CON(R)C6H,X (R = H, Me) when these groups are attached at the 3 or 4 position of the N-phenyl ring. Note that Silipo and Hansch [6] excluded 12 compounds as outliers.

It is difficult to make direct comparisons of the maxima found by the MLRI method and the work

Notes to Table 2: a: I = 1; compounds with a non-hydrogen sub&tent at R, and not considered here. b: Zt = 0; 57 compounds used in the training set, taken from the 113 compounds in the L1210 set. c: I, = 0; 21 compounds used in the validation set, taken from the 113 compounds in the L1210 set. d: I, = 0; 35 compounds used in the testing set, taken from the 113 compounds in the L1210 set. e: Z, = 1; 100 compounds used in the training set, taken from the 132 compounds in the Walker set. f: II = 1; 14 compounds used in the validation set, taken from the 132 compounds in the Walker set. g: I, = 1; 18 compounds used in the testing set, taken from the 132 compounds in the Walker set.

136 F.R. Burden et al./ Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137

Table 4 Results from training a 5:8:1 network with log l/ C scaled in the range 0.2-0.8 for the 113 sub-set of the L1210 cell line

973 “4 MR3 MR4 Ca3.4 log l/C GA search bounds

4.32 3.51 7.46 10.8 -0.24 9.66 range of training set

4.03 4.51 5.70 4.2 -0.21 9.70 training set + 10%

5.34 - 1.88 32.2 15.3 -0.91 9.77 chemical bounds

the 113 values from the Walker 256 cell line. The neural net training resulted in a root-mean-square error, RMSE, in the test set of 0.06 and an RMSE for the validation set of 0.075. The latter corresponds to around 13.1% of the mean scaled value of 0.5727. The genetic algorithm search of this surface (which is essentially an interpolation) located maxima, which occurred close to the highest of the input data of 8.76 and (corresponding to 0.99 in the scaled data).

3.3. Extrapolation runs

presented here since the use of the six indicator variables split the data set into 2’j = 64 subsets of which the indicator set giving the largest log l/C has been selected below. The work here only makes use of one indicator variable, Z,, and these two sub-sets, known as the Walker 256 cell line and L1210 cell line, were trained and tested separately.

The regression analysis gives the optimum values of 7~s and MR, as 2.88 and 4.7 respectively. The maximum log l/C is therefore 9.13 (Walker 256 cell line) and 8.89 (L1210 cell line), assuming Z4 = 1, Z5 = Z6 = 0. While the correlation is statistically satisfactory, interpretation of the results is difficult, with a significant percentage of the variance being ex- plained by the indicator variables, rather than the more chemically relevant substituent constants.

In order to minimize any saturation effects that the sigmoidal transfer function might have on the ability of the network to extrapolate, the neural network was trained with the log l/C scaled to between 0.2 and 0.8. The resultant neural network had an RMSE of 0.05, which corresponds to 8.9% of the expected, mean scaled value of 0.5727. The genetic algorithm searches were repeated and the results summarized in Tables 3 and 4. The Tables show that there is a con- siderable variability in rrd and CF~,~ which is con- sistent with the flat response surfaces shown by An- drea and Kalayeh [ 1 I in their analysis of the paramet- ric sensitivities of the biological response to the indi- vidual parameters. Silipo and Hansch also found that these parameters were not statistically significant in their multiple linear regression analysis [6].

3.2. Consistency check 3.4. Interpretation of extrapolated runs

The results from the genetic algorithm searches are shown in Tables 3 and 4. An internal consistency check using the neural network trained with log l/C scaled between 0.01 and 0.99 was run initially using

As noted by Andrea and Kalayeh [6] the slopes of the response surfaces are sensitive to the values of all five independent variables suggesting substantial in- ter-variable couplings which neural networks are able to take into account. Clearly the curvature of the in-

Table 5 Compounds predicted to have a high DHFX inhibitory properties

Cell line R, 573 MR3 R4 774 MR4 Ea3.4 Predicted log 1 /C

Walker 256 GA values 0.95 13.5 GA values 5.47 96.3 -0.72 10.12 Walker 256 -CzH, 1.02 10.3 -OCH,C,H,-4’-O&H, 3”,4”Cl, 4.63 77.9 -0.34 10.12 L1210 GA values 5.34 32.2 GA values -1.88 15.3 -0.91 9.79 L1210 CH,Si(C,H,), 3.26 43.5 NHSO,CH, - 1.18 18.2 - 0.42 9.35

-OCH,CON(CH,CH,)Cl - 1.39 34.9 - 0.43 9.27 CH,CH,COOH -0.29 16.5 -0.28 9.14 cyclopropyl 1.14 13.5 - 0.42 9.10 N-propyl 1.55 15.0 -0.34 8.89

F.R. Burden et al./Chemometrics and Intelligent Laboratory Systems 38 (1997) 127-137 137

dividual response surfaces depends on the values of containing the maximum will be compressed by the the other variables but near the response maximum nd neural network to fit within the boundaries of the

and Cc,, are flat and are therefore uninformative. sigmoid function, thus obscuring the height and loca- Using the values of the parameters predicted for tion of the maximum.

the search over the chemically accessible region, the data set of Andrea and Kalayeh [l] and the tables of Hansch and Leo [l l] were searched for appropriate substituents. Given the flatness of the response surface to rTT3 and CUE,, we have given most weight to the other parameters in the selection of substituents. Table 5 summarizes the results, suggesting substituents having parameters close to the required optimum. The predicted value of optimum log l/C for the Walker 256 leukemia tumors is 10.1. This is con- siderably higher than the maximum of the training dataset of 8.74.

Similar calculations were carried out for the 113 sub-set of Andrea and Kalayeh [l] relating to the L1210 cell line. The results, shown in Table 5, indi- cate that a range compounds having optimum substituents would exhibit a log l/C value of from 8.89 to 9.35 which are also higher than the training set maximum of 8.37.

Genetic algorithms show some promise in solving the neural network inversion problem for QSAR and may indeed be useful in finding the general position of maxima of many QSAR surfaces and thereby to help in predicting which substituents are likely to give rise to higher biological activity. It is important to limit the search to a chemically reasonable parameter space. Several questions for further study show that more work is needed to determine how the shape of the activity surface affects the genetic algorithms performance; how well the genetic algorithm is able determine the global maximum on the activity surface and what effect linear dependence has on the capacity of genetic algorithms to find maxima. There is some evidence to suggest genetic algorithms may perform better on linearly independent data even though linear dependence is of little consequence to a neural network. Work in this area is continuing, and inversion of the neural nets by backpropagation of errors to modify the molecular parameters of an ex- isting active compound is the focus.

4. Conclusions

Our study as shown, as has previous work on QSAR using neural nets, that indicator variables may be dispensed with in neural net structure-activity studies. As the majority of the variance in Silipo and Hansch’s [6] QSAR analysis is accounted for by the indicator variables, neural networks offer consider- able advantages over MLRI in this respect.

References

[I] T.A. Andrea, H.J. Kalayeh, J. Med. Chem. 34 (1991) 2824- 2836.

[2] D.T. Manallack, D.J. Livingstone, Med. Chem. Res. 2 (1992) 181-190.

[3] S.-S. So, W.G. Richards, J. Med. Chem. 35 (1992) 3201- 3207.

The genetic algorithm was able to predict values of the biological activity that were higher than those in the training set for both leukemia sub-sets. It is difficult to assess whether the genetic algorithm has found the real maxima since the neural network is the only satisfactory model of the dataset. It is interesting to note that the regions in which the genetic algorithm found a maximum vary quite markedly ac- cording to the region searched. Unless sufficient scaling latitude is given it is possible that the region

[4] D.W. Salt, N. Yildiz, D.J. Livingstone, C.J. Tinsley, Pestic. Sci. 36 (1992) 161-170.

[5] S. Forest, Science 261 (1993) 872-878. [6] C. Silipo, C. Hansch, J. Am. Chem. Sot. 97 (1975) 6849. [7] R.T. Kroemer, P. Hecht, J. Cornput-Aided Mol. Des. 9 (1995)

396-406. [S] J.D. Hirst, R.D. King, M.J.E. Stemberg, J. Cornput-Aided

Mol. Des. 8 (1994) 421-432. [9] Propagator, ARD Corporation, 1993, Columbia, U.S.A.

[lo] Evolver Ver.2.1., Ax&is Inc., 1994, Seattle, U.S.A. [ll] C. Hansch, A. Leo, Substituent Constants for Correlation

Analysis in Chemistry and Biology, J. Wiley and Sons, Bris- bane, 1979.

predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms

Documents