qsar/qspr model development and validation for successful prediction and interpretation

1

QSAR/QSPR Model development and Validation

for successful prediction and interpretation

8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009

Mohsen Kompany-Zareh

In the name of GOD

Contents:

2

Introduction Selwood data set (all descriptors Model development Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV Internal validation QUIK Selwood data (a # descriptors Descriptor selection LMO and Jackknife Cross model validation Bootstrapping Training and test set selection Leverage

3

QSPR/QSAR (Quantitative structure activity relationship)

Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals.Application: Prediction of property for a variety of chemicals,prior to expensive synthesis and experimental measurement.To determine environmental risk of thousands of untested industrial chemicals.Description of a mechanism of action for a variety of

chemicals,

Introduction

molec. 6

molec. 5

Descriptors

1.885120.93476.92122.04

2.913108.77508.56150.17

3.312122.85554.01164.08

3.711123.92571.26178.10

2.696120.49505.61156.01

3.106119.98518099247.93

2.924

1.992

1.987

1.544

2.079

1.530

X yLipoph. LUMO MW

Surf. Area

Activities

??

QSARmodel

molec. 1

molec. 2

molec. 3

molec. 4

Introduction

5

Data preparation:

1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data.

2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.)

more than 3000 descr.s

Introduction

DRAGON (Todeschini et al, 2001ADAPT (Jurs 2002; Stuper and Hurs 1976OASIS (Mekenyan and Bonchev 1986CODESSA (Katritzky et al, 1994Gaussian …

6

Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes;

From a very large number of descriptors from different softwares

As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select

Structure ActivityModelDescriptors:Topologic (edges and verticesGeometric (surface, volume, …Electronic (e dencity, local chargesConstitutional (#C, #OH, …….

Introduction

7

Selwood data: D (31x53) , Y(31x1)

>> load selwood.txt;>> D=selwood(:,1:end-1);>> y=selwood(:,end);

31 molecules53 descriptors

31 antifilarial antimycin analogous cantifilarial antimycin analogous characterized by 53 physicochemical descriptors

Selwood, et alJ Med Chem (1990) 33, 136.

Data set

8

Model generation:Indep variables: descriptorsDepend variables: properties (activities)

Model developm methods:Multiple linear regression MLR,Partial least squares PLS,Artificial neural netorks (ANNs),k-nearest neighbor

Model development

#samples<#descr.s !!

9

D b = yb = D+ y

Multiple Linear Regression Simplest model:

>> b= D\y;>> yEST= D*b;

0 20 40

-5

0

5

22 of 53 coeff.s are zero!!

b0-1 0 1 2

-1

0

1

2

y

yES

T

Model is developed

Application of model ?

Validation?

D yb

Model development

R2=1

10

Other statistical diagnostics:Coefficient of determination, R2

Fraction of dependent variable variance explained by a model (e.g. MLR model).

Closer to unity is better.

It is a measure of the quality of fit between model-predicted and experimental values, and does not reflect the predictive power, at all.

train

itraini

train

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

averageyerimentalactualy

estimatedy

i

i

:)(exp:

:ˆ

Model development

11

Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !!

They do not include model validation in model development.

n/#descr=11/2>5 r2

cv < r2 fit : unstable model

log(1/IGC50)=0.54 logKw – 8.90 LUMO – 0.99 n=11, r2=0.82, s=0.28, r2

cv =0.64

Schultz, et alToxicity of Tetrahymena PyriformisQSAR 2002 meeting, May 25-29, Ottawa, Canada.Ex

Model development

12

Model development

Ex Akers et alStruc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39.

Claim: The goodness of fit is satisfactory for predictive purposes.

Ex Benigni et alQSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714.

“..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !!

x

13

Problem:

Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!

..so, the model is not reliable !!

14

Model validation

Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals.

Model validation:

Quantitative assessment of model robustness and its predictive power.

Definition of the application domainof the model in the space of applied chemical descriptors

15

DivisionDivision to calibration and test sets

calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:);

b=calD\caly; %model development

valD valyvalidation

calD

Model

calyDevelopm.

There are many different methods for selection of members in training and test set.

External validation Model validation

1 4 7 10 13 … 2 5 8 11 14 … 3 6 9 12 15…

16

>> calyEST=calD*b;

>> valyEST=valD*b; % model validation

-5 0 5-5

0

5

testy

test

yES

T

-1 0 1 2-1

0

1

2

caly

caly

ES

T

Not good prediction

5 10 15 20

-505

x 10-14

calDr

resi

dual

2 4 6 8 10-4-2024

testDr

resi

dual

Model validation

R2=1

17

>> calyEST=calD*b; %root mean square error of calibr>> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr)

>> valyEST=valD*b; % root mean square error validation>> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr)

RMSEC=2.9396e-014

RMSEP=2.2940

Not good prediction

5 10 15 20

-505

x 10-14

calDr

resi

dual

2 4 6 8 10-4-2024

testDr

resi

dual

c

r

iii

r

yyRMSEC

c

1

2)ˆ(

t

r

jjj

r

yyRMSEP

t

1

2)ˆ(

Model validation

18

A model with high R2 could be a poor predictor:

Variable muticollinearity, Statistically insignificant model descriptors, High leverage points in the training set.

Model validation

A regression model with k descriptors and n training set compounds may be acceptable for validation only if :

n > 4 k

For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.

19

Validation strategies:

Randimization of model property

(Y-scrambling).

Internal validation.

Only training

External validation.

Division to training and test sets.

Model validation

20

Predictive power of QSAR models:

From sufficiently large external test set of compounds that were not used in the model development.

Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.

Zefirov, et alQSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models? , J Chem Inf Comput Sci (2001) 41, 1022-1027.

test

itraini

test

iii

ext

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Model validation

21

training

ii

training

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2

calibr sample number

y

2 4 6 8 10-5

0

5

test sample number

y

test

jj

test

jjj

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Train

Test

residual SS

Model validation

22

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2


y

2 4 6 8 10-5

0

5

test sample number

y

training

ii

training

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

test

jj

test

jjj

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Train

Test

Tot variance SS

Model validation

23

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2


y

2 4 6 8 10-5

0

5

test sample number

y

Train

Test

R2 = 1.0000

q2 = -8.5220

5.56.5212 q

14.9108.11

262

R

Model validation

24

Internal validation:

Internal validation

Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes

Similar to R2 !

train

ii

train

iii

yy

yyLOOq

1

2

1

2

2

)(

)ˆ(1

CV corr coeff

25

Training set, only

Internal validation

Cross validationLeave-one-out

Internal validation

Useful when small number of molecules are available.

26

Subsamples(copies from Training set

# subsamples = # molec.s

Internal validation

27

SubTrain1 SubValid1 211 )ˆ( yy

222 )ˆ( yy

233 )ˆ( yy

244 )ˆ( yy

255 )ˆ( yy

cumPRESS# subsamples = # molec.s in training set

SubTrain3

SubTrain2 SubValid2

SubValid3

SubValid5

SubTrain5

Internal validation

28

for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; endcumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))

LOO CV Internal validation

29

5 10 15 20

-2

0

2

4

6

training sample number

yq2LOO = -4.8574

RMSECV = 2.0397

>> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2

>> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end

q2ASYMPTOT = 1.0000

REJECT

Internal validation

q2LOO and R2 should not be considerably different .

30

Many authors consider qq22LOO>0.5 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set.

Ex Cronin, et alThe importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176.

Ex Moss, et alQ. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317.

Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, 718-726.

Internal validation

31

Small value of q2LOO or q2LMO test indicates low prediction ability,

But opposite is not necessarily true. (high q2LOO is necess and not enough)

It indicates robustness, but not the prediction ability of model.

Internal validation

32

It has been shown that there exist no correlation between LOO cross-validation q2LOO and the correlation coefficient R2 between the predicted and observed activities for an external test set.

Kubinyi, et alThree dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, 2553-2564.

Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.

High q2LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition.

Internal validation

33

QUIK

R. Todeschini, et alDetecting bad Regression models: Multicriteria fitness functions in regression analysisAnal. Chim Acta (2004) 515, 199-208.For illustration of correlation (collinearity) among independent variables.

Based on Multivariate correlation index K

QUIK

34

111222243336444855510

>> corr(M)

4 correlated descriptorsM=

1111111111111111

1 2 3 40

1

2

3

4

Factor No

Eig

en v

alue

>> p=size(M,2);>> CorrEV=svds(corr(M),p);

1020304050

y=

It seems possible to use svd(M)

QUIK

35

>> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p);

KM = 1.0000 Maximum correlation between descriptors>> [KM]=QUIK(M)function

>> [KMY]=QUIK([M Y]) %in the pres of depend var

if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end

KMY = 1.0000

REJECT

QUIK

36

.79.17.87.89.28

.96.98.74.20.47

.52.27.14.30.06

.88.25.01.66.99

>> corr(M)

>> M=rand(4,5)M=

1.5468.3863.1101.6879.54681.3623-.7227.0419.3863.36231.1784-.3545.1101-.7227.17841.2450.6879.0419-.3545.24501

1234

y=

QUIK

37

KM = 0.5000>> [KM]=QUIK(M)

>> [KMY]=QUIK([M Y])

if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end

KMY = 0.6000

NOT REJECTED

1 2 3 4 50

1

2

3

Factor No

Eig

en v

alue

1 2 3 4 50

0.5

1

1.5

2

2.5

Factor No

Eig

en v

alue

QUIK

38

KM = 0.7919>> [KM]=QUIK(calD) % Selwood data, all descriptors

>> [KMY]=QUIK([calD Y])

>>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end

KMY = 0.7923

REJECTED

0 10 20 30 40 500

10

20

Factor No

Eig

en v

alue

0 10 20 30 40 500

10

20

Factor No

Eig

en v

alue

QUIK

39

Development of MLR model using all descriptors is not acceptable.

Model can be improved, using a factor based method,

…and by descriptor selection.

40

>> D=Dini(:,[51 37 35 38 39 36 15]);

Development of MLR model using a number of descriptors.

RMSEC= 0.4989

RMSEP= 0.4993Comparable

Improved

-2 0 2-2

0

2

caly

caly

ES

T

0 10 20

-1

0

1

calDr

resi

dual

-2 0 2-2

0

2

testy

test

yES

T

0 5 10-1

-0.5

0

0.5

1

testDr

resi

dual

A number of descriptors

41

0 5 10 15 20-1

0

1

2


y

2 4 6 8 10-5

0

5

test sample number

y

R2 = 0.6495

q2 = 0.5490Comparable

Improved

q2LOO = 0.2816

5 10 15 20-2

-1

0

1

2LOO CV

training sample number

y

NOT REJECTED


D=Dini(:,[51 37 35 38 39 36 15]);

42

1 2 3 4 5 6 70

1

2

3

4

5

Factor No

Eig

en v

alue

1 2 3 4 5 6 7 80

1

2

3

4

5

Factor No

Eig

en v

alue

KX = 0.6384

QUIK

KXY = 0.5996

if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endREJECTED


D=Dini(:,[51 1 38]);

43

KX = 0.3159

QUIK

KXY = 0.3953

if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endNOT REJECTED


1 2 30

0.5

1

1.5

2

Factor No

Eig

en v

alue

1 2 3 40

0.5

1

1.5

2

Factor NoE

igen

val

ue

44

Using proper set of descriptors, improved results from MLR can be obtained.

But how the proper set of descriptors can be selected.

45

Descriptor selection:

-Forward selection,-Backward elimination,-Genetic algorithm-Kohonen map-SPA-CWSPA

Descriptor Selection


Kohonen Map53 × 31

Rows (descriptors) as input for Kohonen map:

1 .Sampling from all regions in descriptors space

2 .Sampling from regions which descriptors have high correlation with Y (activity)

selwood data matrix

By: Mehdi Vasighi

47


Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12

Successive projections algorithm (SPA)

SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.

Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73.

Important parameters:

1- Starting vector

2- N, maximum number of descriptors


Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise

selection of variables is the orthogonality of them to the

previously selected variable, relation of entered vector as an

independent variable to the response is not considered.

Incorporation of a form of correlation ranking procedure

by which the variables are weighted by their correlation

coefficient with dependent variable, within SPA

procedure will overcome this limitation of SPA.


M. Kompany-Zareh and Y. AkhlaghiCorrelation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives,J of Chemom, (2007) 21, 239-250.

qsar/qspr model development and validation for successful prediction and interpretation

Documents

mlr model

acceptable model

developedapplication

model developmentr2

qsarqspr model development

simple interpretation

data set

introduction selwood