qsar/qspr model development and validation for successful prediction and interpretation
DESCRIPTION
In the name of GOD. 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009. QSAR/QSPR Model development and Validation for successful prediction and interpretation. Mohsen Kompany-Zareh. Contents:. Introduction Selwood data set (all descriptors Model development - PowerPoint PPT PresentationTRANSCRIPT
1
QSAR/QSPR Model development and Validation
for successful prediction and interpretation
8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009
Mohsen Kompany-Zareh
In the name of GOD
Contents:
2
Introduction Selwood data set (all descriptors Model development Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV Internal validation QUIK Selwood data (a # descriptors Descriptor selection LMO and Jackknife Cross model validation Bootstrapping Training and test set selection Leverage
3
QSPR/QSAR (Quantitative structure activity relationship)
Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals.Application: Prediction of property for a variety of chemicals,prior to expensive synthesis and experimental measurement.To determine environmental risk of thousands of untested industrial chemicals.Description of a mechanism of action for a variety of
chemicals,
Introduction
molec. 6
molec. 5
Descriptors
1.885120.93476.92122.04
2.913108.77508.56150.17
3.312122.85554.01164.08
3.711123.92571.26178.10
2.696120.49505.61156.01
3.106119.98518099247.93
2.924
1.992
1.987
1.544
2.079
1.530
X yLipoph. LUMO MW
Surf. Area
Activities
??
QSARmodel
molec. 1
molec. 2
molec. 3
molec. 4
Introduction
5
Data preparation:
1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data.
2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.)
more than 3000 descr.s
Introduction
DRAGON (Todeschini et al, 2001ADAPT (Jurs 2002; Stuper and Hurs 1976OASIS (Mekenyan and Bonchev 1986CODESSA (Katritzky et al, 1994Gaussian …
6
Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes;
From a very large number of descriptors from different softwares
As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select
Structure ActivityModelDescriptors:Topologic (edges and verticesGeometric (surface, volume, …Electronic (e dencity, local chargesConstitutional (#C, #OH, …….
Introduction
7
Selwood data: D (31x53) , Y(31x1)
>> load selwood.txt;>> D=selwood(:,1:end-1);>> y=selwood(:,end);
31 molecules53 descriptors
31 antifilarial antimycin analogous cantifilarial antimycin analogous characterized by 53 physicochemical descriptors
Selwood, et alJ Med Chem (1990) 33, 136.
Data set
8
Model generation:Indep variables: descriptorsDepend variables: properties (activities)
Model developm methods:Multiple linear regression MLR,Partial least squares PLS,Artificial neural netorks (ANNs),k-nearest neighbor
Model development
#samples<#descr.s !!
9
D b = yb = D+ y
Multiple Linear Regression Simplest model:
>> b= D\y;>> yEST= D*b;
0 20 40
-5
0
5
22 of 53 coeff.s are zero!!
b0-1 0 1 2
-1
0
1
2
y
yES
T
Model is developed
Application of model ?
Validation?
D yb
Model development
R2=1
10
Other statistical diagnostics:Coefficient of determination, R2
Fraction of dependent variable variance explained by a model (e.g. MLR model).
Closer to unity is better.
It is a measure of the quality of fit between model-predicted and experimental values, and does not reflect the predictive power, at all.
train
itraini
train
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
averageyerimentalactualy
estimatedy
i
i
:)(exp:
:ˆ
Model development
11
Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !!
They do not include model validation in model development.
n/#descr=11/2>5 r2
cv < r2 fit : unstable model
log(1/IGC50)=0.54 logKw – 8.90 LUMO – 0.99 n=11, r2=0.82, s=0.28, r2
cv =0.64
Schultz, et alToxicity of Tetrahymena PyriformisQSAR 2002 meeting, May 25-29, Ottawa, Canada.Ex
Model development
12
Model development
Ex Akers et alStruc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39.
Claim: The goodness of fit is satisfactory for predictive purposes.
Ex Benigni et alQSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714.
“..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !!
x
13
Problem:
Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!
..so, the model is not reliable !!
14
Model validation
Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals.
Model validation:
Quantitative assessment of model robustness and its predictive power.
Definition of the application domainof the model in the space of applied chemical descriptors
15
DivisionDivision to calibration and test sets
calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:);
b=calD\caly; %model development
valD valyvalidation
calD
Model
calyDevelopm.
There are many different methods for selection of members in training and test set.
External validation Model validation
1 4 7 10 13 … 2 5 8 11 14 … 3 6 9 12 15…
16
>> calyEST=calD*b;
>> valyEST=valD*b; % model validation
-5 0 5-5
0
5
testy
test
yES
T
-1 0 1 2-1
0
1
2
caly
caly
ES
T
Not good prediction
5 10 15 20
-505
x 10-14
calDr
resi
dual
2 4 6 8 10-4-2024
testDr
resi
dual
Model validation
R2=1
17
>> calyEST=calD*b; %root mean square error of calibr>> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr)
>> valyEST=valD*b; % root mean square error validation>> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr)
RMSEC=2.9396e-014
RMSEP=2.2940
Not good prediction
5 10 15 20
-505
x 10-14
calDr
resi
dual
2 4 6 8 10-4-2024
testDr
resi
dual
c
r
iii
r
yyRMSEC
c
1
2)ˆ(
t
r
jjj
r
yyRMSEP
t
1
2)ˆ(
Model validation
18
A model with high R2 could be a poor predictor:
Variable muticollinearity, Statistically insignificant model descriptors, High leverage points in the training set.
Model validation
A regression model with k descriptors and n training set compounds may be acceptable for validation only if :
n > 4 k
For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.
19
Validation strategies:
Randimization of model property
(Y-scrambling).
Internal validation.
Only training
External validation.
Division to training and test sets.
Model validation
20
Predictive power of QSAR models:
From sufficiently large external test set of compounds that were not used in the model development.
Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.
Zefirov, et alQSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models? , J Chem Inf Comput Sci (2001) 41, 1022-1027.
test
itraini
test
iii
ext
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Model validation
21
training
ii
training
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
test
jj
test
jjj
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Train
Test
residual SS
Model validation
22
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
training
ii
training
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
test
jj
test
jjj
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Train
Test
Tot variance SS
Model validation
23
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
Train
Test
R2 = 1.0000
q2 = -8.5220
5.56.5212 q
14.9108.11
262
R
Model validation
24
Internal validation:
Internal validation
Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes
Similar to R2 !
train
ii
train
iii
yy
yyLOOq
1
2
1
2
2
)(
)ˆ(1
CV corr coeff
25
Training set, only
Internal validation
Cross validationLeave-one-out
Internal validation
Useful when small number of molecules are available.
26
Subsamples(copies from Training set
# subsamples = # molec.s
Internal validation
27
SubTrain1 SubValid1 211 )ˆ( yy
222 )ˆ( yy
233 )ˆ( yy
244 )ˆ( yy
255 )ˆ( yy
cumPRESS# subsamples = # molec.s in training set
SubTrain3
SubTrain2 SubValid2
SubValid3
SubValid5
SubTrain5
Internal validation
28
for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; endcumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))
LOO CV Internal validation
29
5 10 15 20
-2
0
2
4
6
training sample number
yq2LOO = -4.8574
RMSECV = 2.0397
>> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2
>> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end
q2ASYMPTOT = 1.0000
REJECT
Internal validation
q2LOO and R2 should not be considerably different .
30
Many authors consider qq22LOO>0.5 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set.
Ex Cronin, et alThe importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176.
Ex Moss, et alQ. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317.
Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, 718-726.
Internal validation
31
Small value of q2LOO or q2LMO test indicates low prediction ability,
But opposite is not necessarily true. (high q2LOO is necess and not enough)
It indicates robustness, but not the prediction ability of model.
Internal validation
32
It has been shown that there exist no correlation between LOO cross-validation q2LOO and the correlation coefficient R2 between the predicted and observed activities for an external test set.
Kubinyi, et alThree dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, 2553-2564.
Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.
High q2LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition.
Internal validation
33
QUIK
R. Todeschini, et alDetecting bad Regression models: Multicriteria fitness functions in regression analysisAnal. Chim Acta (2004) 515, 199-208.For illustration of correlation (collinearity) among independent variables.
Based on Multivariate correlation index K
QUIK
34
111222243336444855510
>> corr(M)
4 correlated descriptorsM=
1111111111111111
1 2 3 40
1
2
3
4
Factor No
Eig
en v
alue
>> p=size(M,2);>> CorrEV=svds(corr(M),p);
1020304050
y=
It seems possible to use svd(M)
QUIK
35
>> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p);
KM = 1.0000 Maximum correlation between descriptors>> [KM]=QUIK(M)function
>> [KMY]=QUIK([M Y]) %in the pres of depend var
if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end
KMY = 1.0000
REJECT
QUIK
36
.79.17.87.89.28
.96.98.74.20.47
.52.27.14.30.06
.88.25.01.66.99
>> corr(M)
>> M=rand(4,5)M=
1.5468.3863.1101.6879.54681.3623-.7227.0419.3863.36231.1784-.3545.1101-.7227.17841.2450.6879.0419-.3545.24501
1234
y=
QUIK
37
KM = 0.5000>> [KM]=QUIK(M)
>> [KMY]=QUIK([M Y])
if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end
KMY = 0.6000
NOT REJECTED
1 2 3 4 50
1
2
3
Factor No
Eig
en v
alue
1 2 3 4 50
0.5
1
1.5
2
2.5
Factor No
Eig
en v
alue
QUIK
38
KM = 0.7919>> [KM]=QUIK(calD) % Selwood data, all descriptors
>> [KMY]=QUIK([calD Y])
>>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end
KMY = 0.7923
REJECTED
0 10 20 30 40 500
10
20
Factor No
Eig
en v
alue
0 10 20 30 40 500
10
20
Factor No
Eig
en v
alue
QUIK
39
Development of MLR model using all descriptors is not acceptable.
Model can be improved, using a factor based method,
…and by descriptor selection.
40
>> D=Dini(:,[51 37 35 38 39 36 15]);
Development of MLR model using a number of descriptors.
RMSEC= 0.4989
RMSEP= 0.4993Comparable
Improved
-2 0 2-2
0
2
caly
caly
ES
T
0 10 20
-1
0
1
calDr
resi
dual
-2 0 2-2
0
2
testy
test
yES
T
0 5 10-1
-0.5
0
0.5
1
testDr
resi
dual
A number of descriptors
41
0 5 10 15 20-1
0
1
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
R2 = 0.6495
q2 = 0.5490Comparable
Improved
q2LOO = 0.2816
5 10 15 20-2
-1
0
1
2LOO CV
training sample number
y
NOT REJECTED
A number of descriptors
D=Dini(:,[51 37 35 38 39 36 15]);
42
1 2 3 4 5 6 70
1
2
3
4
5
Factor No
Eig
en v
alue
1 2 3 4 5 6 7 80
1
2
3
4
5
Factor No
Eig
en v
alue
KX = 0.6384
QUIK
KXY = 0.5996
if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endREJECTED
A number of descriptors
D=Dini(:,[51 1 38]);
43
KX = 0.3159
QUIK
KXY = 0.3953
if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endNOT REJECTED
A number of descriptors
1 2 30
0.5
1
1.5
2
Factor No
Eig
en v
alue
1 2 3 40
0.5
1
1.5
2
Factor NoE
igen
val
ue
44
Using proper set of descriptors, improved results from MLR can be obtained.
But how the proper set of descriptors can be selected.
45
Descriptor selection:
-Forward selection,-Backward elimination,-Genetic algorithm-Kohonen map-SPA-CWSPA
Descriptor Selection
Descriptor Selection
Kohonen Map53 × 31
Rows (descriptors) as input for Kohonen map:
1 .Sampling from all regions in descriptors space
2 .Sampling from regions which descriptors have high correlation with Y (activity)
selwood data matrix
By: Mehdi Vasighi
47
Descriptor Selection
Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12
Successive projections algorithm (SPA)
SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.
Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73.
Important parameters:
1- Starting vector
2- N, maximum number of descriptors
Descriptor Selection
Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise
selection of variables is the orthogonality of them to the
previously selected variable, relation of entered vector as an
independent variable to the response is not considered.
Incorporation of a form of correlation ranking procedure
by which the variables are weighted by their correlation
coefficient with dependent variable, within SPA
procedure will overcome this limitation of SPA.
Descriptor Selection
M. Kompany-Zareh and Y. AkhlaghiCorrelation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives,J of Chemom, (2007) 21, 239-250.