qsar study of hiv protease inhibitors using neural network and genetic algorithm akmal aulia, 1...

1
QSAR Study of HIV Protease Inhibitors QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Using Neural Network and Genetic Algorithm Akmal Aulia, Akmal Aulia, 1 1 Sunil Kumar, Sunil Kumar, 2 2 Rajni Garg, Rajni Garg, * 3 * 3 A. Srinivas Reddy, A. Srinivas Reddy, 4 4 1 Computational Science Research Center, San Diego State University, CA; Computational Science Research Center, San Diego State University, CA; 2 ECE Dept., San Diego State University, San Diego, CA; ECE Dept., San Diego State University, San Diego, CA; 3 Chem. Dept., California State University, San Marcos, CA; Chem. Dept., California State University, San Marcos, CA; 4 Molecular Modeling Group, IICT, Hyderabad, India. Molecular Modeling Group, IICT, Hyderabad, India. Descriptor Thinning Results Materials and Methods Summary & Future Work Introduction Total Descriptors IC 50 set: Final Descriptors EC 50 set: Final Descriptors Linear and Non-linear regression techniques are employed to analyze a large dataset of 334 compounds of HIV protease inhibitors (Kempf et al.). The data set was studied using MLR (Multiple Linear Regression) and ANN (Artificial Neural Network) techniques to develop QSAR (Quantitative Structure-Activity Relationship) models. Each ligand (inhibitor or drug molecule) was described by means of physico- chemical and structural descriptors (features) which encode constitutional, electrostatic, geometrical, quantum and topological properties. The capability of descriptors to address the variations in ligand(s) was linked to the predictive power of QSAR models. Combined information from these models helps in 'transforming data into information and information into knowledge' from chem-informatics point of view. References Reported dataset (Kempf et al.) with their experimental Biological Activity (EC 50 and IC 50 ) Lower energy conformation is obtained for each compound by means of Molecular Mechanics Minimization. A total of 277 descriptors calculated. Objective Descriptors(Matlab): IC 50 dataset(reduced from 277 to 148), EC 50 dataset(reduced from 277 to 157). Subjective Descriptors(WEKA/GA): IC 50 dataset(reduced from 148 to 9), EC 50 dataset(reduced from 157 to 7) Both MLR and FNN methods were implemented in WEKA. (1) Fernandez et al.; “Quantitative structure-activity relationship to predict differential inhibition of aldose reductase by flavonoid compounds” Bioorganic and Medicinal Chemistry, 2005, 13, 3269-3277. (2) (a)CODESSA software, Semichem Inc., USA; (b) MATLAB, The MathWorks Inc.; (c) WEKA software, the University of Waikato, New Zealand. (3) Fernandez, M. and Caballero, J.;”Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks”, J. Mol. Model., 2006, 12, 168-181 (4) Goldberg, D. E.; Genetic Algorithms in Search Optimization & Machine Learning; Addison-Wesley:Reading, MA, 2000. (5) “Data Mining: Practical Machine Learning tools and techniques”, 2 nd Edition, Morgan Kaufmann, San Fransisco, 2005. Type Name Description C onstitutional R elative num berofC atom s C onstitutional R elative num berofN atom s C onstitutional R elative num berofrings Electrostatic M ax partialcharge fora H atom [Zefirov's PC ] Electrostatic FNSA-3 FractionalPN SA (PN SA-3/TM SA)[Zefirov's PC ] Quantum M ax electroph.react.Index fora C atom Topological Average inform ation content(order0) Topological Average structuralinform ation content(order2) Topological Palaban index (_Property_) Type Name Description C onstitutional R elative num berofarom atic bonds Electrostatic M in partialcharge fora C atom [Zefirov's PC ] Electrostatic M ax partialcharge fora H atom [Zefirov's PC ] Electrostatic M ax partialcharge fora N atom [Zefirov's PC ] Quantum M ax netatom ic charge fora N atom Topological Average com plem entary inform ation content(order2) Topological Balaban index IC 50 dataset:D escriptors C ontent EC 50 dataset:D escriptors C ontent Varying no.ofhidden nodes on FN N :EC 50 dataset H id.N ode Training Set 10 folds C .V. 66% split 90% split RMSE RMSE RMSE RMSE 5 0.6531 5.4255 0.2409 9.0114 0.5004 8.1552 0.9746 5.9407 6 0.7326 4.9155 0.1899 10.8496 0.4924 8.1916 0.9797 5.2638 7 0.7590 4.6711 0.1777 11.0438 0.5044 8.1313 0.9813 5.2615 8 0.6905 5.1708 0.1889 10.2258 0.4921 8.1996 0.9489 6.3812 9 0.6981 5.1194 0.2047 9.7865 0.4399 8.6151 0.9783 5.3847 R 2 R 2 R 2 R 2 IC 50 dataset Varying no.ofhidden nodes on FN N :IC 50 dataset H id.N ode Training Set 10 folds C .V. 66% split 90% split RMSE RMSE RMSE RMSE 7 0.9922 6.2792 0.8668 26.1962 0.8789 22.4858 0.9158 5.5917 8 0.9923 6.3615 0.8617 26.0392 0.8937 20.8355 0.9022 5.9416 9 0.9947 5.2231 0.8827 23.9733 0.9108 19.8493 0.9205 5.0076 10 0.9973 3.6241 0.8810 24.1843 0.8917 20.5547 0.9097 5.3009 11 0.9953 5.1195 0.8622 25.8577 0.9130 20.6060 0.9106 6.2389 R 2 R 2 R 2 R 2 EC 50 dataset For the IC 50 dataset, the constitutional and topological properties have the largest contribution, while for the EC 50 dataset, electrostatic and topological properties are significant. Non-linear models have better predictive capability. However, the linear models can be interpreted better mechanistically. Presence of similar descriptors in both types of models validates our results. Further studies using other statistical and ANN based regression techniques are in progress, in order to find the best QSAR models and descriptors. These models will serve as useful computational tools for prediction of biological activity of this class of HIV protease inhibitors. Research Design

Upload: christopher-taylor

Post on 04-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational

QSAR Study of HIV Protease Inhibitors QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic AlgorithmUsing Neural Network and Genetic Algorithm

Akmal Aulia,Akmal Aulia,11 Sunil Kumar,Sunil Kumar,22 Rajni Garg,Rajni Garg,* 3* 3 A. Srinivas Reddy,A. Srinivas Reddy,44

11Computational Science Research Center, San Diego State University, CA; Computational Science Research Center, San Diego State University, CA; 22ECE Dept., San Diego State University, San Diego, CA; ECE Dept., San Diego State University, San Diego, CA; 33Chem. Dept., California State University, San Marcos, CA; Chem. Dept., California State University, San Marcos, CA; 44Molecular Modeling Group, IICT, Hyderabad, India.Molecular Modeling Group, IICT, Hyderabad, India.

Descriptor Thinning Results

Materials and Methods

Summary & Future Work

Introduction

Total Descriptors

IC50

set: Final Descriptors

EC50

set: Final Descriptors

Linear and Non-linear regression techniques are employed to analyze a large dataset of 334 compounds of HIV protease inhibitors (Kempf et al.).

The data set was studied using MLR (Multiple Linear Regression) and ANN (Artificial Neural Network)

techniques to develop QSAR (Quantitative Structure-Activity Relationship) models.

Each ligand (inhibitor or drug molecule) was described by means of physico-chemical and structural

descriptors (features) which encode constitutional, electrostatic, geometrical, quantum and topological

properties.

The capability of descriptors to address the variations in ligand(s) was linked to the predictive power of

QSAR models.

Combined information from these models helps in 'transforming data into information and information

into knowledge' from chem-informatics point of view.

References

Reported dataset (Kempf et al.) with their experimental Biological Activity (EC50

and IC50

)

Lower energy conformation is obtained for each compound by means of Molecular Mechanics Minimization.

A total of 277 descriptors calculated.

Objective Descriptors(Matlab): IC

50 dataset(reduced from 277 to 148), EC

50 dataset(reduced from 277 to 157).

Subjective Descriptors(WEKA/GA): IC

50 dataset(reduced from 148 to 9), EC

50 dataset(reduced from 157 to 7)

Both MLR and FNN methods were implemented in WEKA.

(1) Fernandez et al.; “Quantitative structure-activity relationship to predict differential inhibition of aldose reductase by flavonoid compounds” Bioorganic and Medicinal Chemistry, 2005, 13, 3269-3277.

(2) (a)CODESSA software, Semichem Inc., USA; (b) MATLAB, The MathWorks Inc.; (c) WEKA software, the University of Waikato, New Zealand.

(3) Fernandez, M. and Caballero, J.;”Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-

regularized neural networks”, J. Mol. Model., 2006, 12, 168-181

(4) Goldberg, D. E.; Genetic Algorithms in Search Optimization & Machine Learning; Addison-Wesley:Reading, MA, 2000.

(5) “Data Mining: Practical Machine Learning tools and techniques”, 2nd Edition, Morgan Kaufmann, San Fransisco, 2005.

Type Name DescriptionConstitutional Relative number of C atomsConstitutional Relative number of N atomsConstitutional Relative number of ringsElectrostatic Max partial charge for a H atom [Zefirov's PC]Electrostatic FNSA-3 Fractional PNSA (PNSA-3/TMSA) [Zefirov's PC]Quantum Max electroph. react. Index for a C atomTopological Average information content (order 0)Topological Average structural information content (order 2)Topological Palaban index (_Property_)

Type Name DescriptionConstitutional Relative number of aromatic bondsElectrostatic Min partial charge for a C atom [Zefirov's PC]Electrostatic Max partial charge for a H atom [Zefirov's PC]Electrostatic Max partial charge for a N atom [Zefirov's PC]Quantum Max net atomic charge for a N atomTopological Average complementary information content (order 2)Topological Balaban index

IC50

dataset: Descriptors Content

EC50

dataset: Descriptors Content

Varying no. of hidden nodes on FNN: EC50 dataset

Hid. NodeTraining Set 10 folds C.V. 66% split 90% split

RMSE RMSE RMSE RMSE5 0.6531 5.4255 0.2409 9.0114 0.5004 8.1552 0.9746 5.94076 0.7326 4.9155 0.1899 10.8496 0.4924 8.1916 0.9797 5.26387 0.7590 4.6711 0.1777 11.0438 0.5044 8.1313 0.9813 5.26158 0.6905 5.1708 0.1889 10.2258 0.4921 8.1996 0.9489 6.38129 0.6981 5.1194 0.2047 9.7865 0.4399 8.6151 0.9783 5.3847

R2 R2 R2 R2

IC50

dataset

Varying no. of hidden nodes on FNN: IC50 dataset

Hid. NodeTraining Set 10 folds C.V. 66% split 90% split

RMSE RMSE RMSE RMSE7 0.9922 6.2792 0.8668 26.1962 0.8789 22.4858 0.9158 5.59178 0.9923 6.3615 0.8617 26.0392 0.8937 20.8355 0.9022 5.94169 0.9947 5.2231 0.8827 23.9733 0.9108 19.8493 0.9205 5.007610 0.9973 3.6241 0.8810 24.1843 0.8917 20.5547 0.9097 5.300911 0.9953 5.1195 0.8622 25.8577 0.9130 20.6060 0.9106 6.2389

R2 R2 R2 R2

EC50

dataset

For the IC50

dataset, the constitutional and topological properties have the largest contribution, while for the

EC50

dataset, electrostatic and topological properties are significant. Non-linear models have better predictive capability. However, the linear models can be interpreted better

mechanistically. Presence of similar descriptors in both types of models validates our results.Further studies using other statistical and ANN based regression techniques are in progress, in order to find

the best QSAR models and descriptors.These models will serve as useful computational tools for prediction of biological activity of this class of HIV

protease inhibitors.

Research Design