qsar study of hiv protease inhibitors using neural network and genetic algorithm akmal aulia, 1...
TRANSCRIPT
QSAR Study of HIV Protease Inhibitors QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic AlgorithmUsing Neural Network and Genetic Algorithm
Akmal Aulia,Akmal Aulia,11 Sunil Kumar,Sunil Kumar,22 Rajni Garg,Rajni Garg,* 3* 3 A. Srinivas Reddy,A. Srinivas Reddy,44
11Computational Science Research Center, San Diego State University, CA; Computational Science Research Center, San Diego State University, CA; 22ECE Dept., San Diego State University, San Diego, CA; ECE Dept., San Diego State University, San Diego, CA; 33Chem. Dept., California State University, San Marcos, CA; Chem. Dept., California State University, San Marcos, CA; 44Molecular Modeling Group, IICT, Hyderabad, India.Molecular Modeling Group, IICT, Hyderabad, India.
Descriptor Thinning Results
Materials and Methods
Summary & Future Work
Introduction
Total Descriptors
IC50
set: Final Descriptors
EC50
set: Final Descriptors
Linear and Non-linear regression techniques are employed to analyze a large dataset of 334 compounds of HIV protease inhibitors (Kempf et al.).
The data set was studied using MLR (Multiple Linear Regression) and ANN (Artificial Neural Network)
techniques to develop QSAR (Quantitative Structure-Activity Relationship) models.
Each ligand (inhibitor or drug molecule) was described by means of physico-chemical and structural
descriptors (features) which encode constitutional, electrostatic, geometrical, quantum and topological
properties.
The capability of descriptors to address the variations in ligand(s) was linked to the predictive power of
QSAR models.
Combined information from these models helps in 'transforming data into information and information
into knowledge' from chem-informatics point of view.
References
Reported dataset (Kempf et al.) with their experimental Biological Activity (EC50
and IC50
)
Lower energy conformation is obtained for each compound by means of Molecular Mechanics Minimization.
A total of 277 descriptors calculated.
Objective Descriptors(Matlab): IC
50 dataset(reduced from 277 to 148), EC
50 dataset(reduced from 277 to 157).
Subjective Descriptors(WEKA/GA): IC
50 dataset(reduced from 148 to 9), EC
50 dataset(reduced from 157 to 7)
Both MLR and FNN methods were implemented in WEKA.
(1) Fernandez et al.; “Quantitative structure-activity relationship to predict differential inhibition of aldose reductase by flavonoid compounds” Bioorganic and Medicinal Chemistry, 2005, 13, 3269-3277.
(2) (a)CODESSA software, Semichem Inc., USA; (b) MATLAB, The MathWorks Inc.; (c) WEKA software, the University of Waikato, New Zealand.
(3) Fernandez, M. and Caballero, J.;”Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-
regularized neural networks”, J. Mol. Model., 2006, 12, 168-181
(4) Goldberg, D. E.; Genetic Algorithms in Search Optimization & Machine Learning; Addison-Wesley:Reading, MA, 2000.
(5) “Data Mining: Practical Machine Learning tools and techniques”, 2nd Edition, Morgan Kaufmann, San Fransisco, 2005.
Type Name DescriptionConstitutional Relative number of C atomsConstitutional Relative number of N atomsConstitutional Relative number of ringsElectrostatic Max partial charge for a H atom [Zefirov's PC]Electrostatic FNSA-3 Fractional PNSA (PNSA-3/TMSA) [Zefirov's PC]Quantum Max electroph. react. Index for a C atomTopological Average information content (order 0)Topological Average structural information content (order 2)Topological Palaban index (_Property_)
Type Name DescriptionConstitutional Relative number of aromatic bondsElectrostatic Min partial charge for a C atom [Zefirov's PC]Electrostatic Max partial charge for a H atom [Zefirov's PC]Electrostatic Max partial charge for a N atom [Zefirov's PC]Quantum Max net atomic charge for a N atomTopological Average complementary information content (order 2)Topological Balaban index
IC50
dataset: Descriptors Content
EC50
dataset: Descriptors Content
Varying no. of hidden nodes on FNN: EC50 dataset
Hid. NodeTraining Set 10 folds C.V. 66% split 90% split
RMSE RMSE RMSE RMSE5 0.6531 5.4255 0.2409 9.0114 0.5004 8.1552 0.9746 5.94076 0.7326 4.9155 0.1899 10.8496 0.4924 8.1916 0.9797 5.26387 0.7590 4.6711 0.1777 11.0438 0.5044 8.1313 0.9813 5.26158 0.6905 5.1708 0.1889 10.2258 0.4921 8.1996 0.9489 6.38129 0.6981 5.1194 0.2047 9.7865 0.4399 8.6151 0.9783 5.3847
R2 R2 R2 R2
IC50
dataset
Varying no. of hidden nodes on FNN: IC50 dataset
Hid. NodeTraining Set 10 folds C.V. 66% split 90% split
RMSE RMSE RMSE RMSE7 0.9922 6.2792 0.8668 26.1962 0.8789 22.4858 0.9158 5.59178 0.9923 6.3615 0.8617 26.0392 0.8937 20.8355 0.9022 5.94169 0.9947 5.2231 0.8827 23.9733 0.9108 19.8493 0.9205 5.007610 0.9973 3.6241 0.8810 24.1843 0.8917 20.5547 0.9097 5.300911 0.9953 5.1195 0.8622 25.8577 0.9130 20.6060 0.9106 6.2389
R2 R2 R2 R2
EC50
dataset
For the IC50
dataset, the constitutional and topological properties have the largest contribution, while for the
EC50
dataset, electrostatic and topological properties are significant. Non-linear models have better predictive capability. However, the linear models can be interpreted better
mechanistically. Presence of similar descriptors in both types of models validates our results.Further studies using other statistical and ANN based regression techniques are in progress, in order to find
the best QSAR models and descriptors.These models will serve as useful computational tools for prediction of biological activity of this class of HIV
protease inhibitors.
Research Design