pls and cross validation

18
Chapter 4: Partial Least Square (PLS) and Cross validation Pavithra. K.B

Upload: pavithra

Post on 22-Jul-2016

7 views

Category:

Documents


0 download

DESCRIPTION

Notes for 3D QSAR

TRANSCRIPT

Page 1: PLS and Cross Validation

Chapter 4: Partial Least Square (PLS) and Cross validation

Pavithra.K.B

Page 2: PLS and Cross Validation

Introduction

• Partial Least Squares (PLS)-Hermann and Svante Wold

• Predict differences in the values of dependent variables, or target properties from the explanatory properties, or descriptors.

• Multiple dependent variables QSAR equation is made for each target property but the coefficients are interrelated

• PLS is an extension of the more familiar technique known as multiple regression (MR).

Page 3: PLS and Cross Validation

• The overall goal is to use the predictors to predict the responses

• This is achieved by extracting latent variables T and U from sampled factors and responses respectively

• The extracted factors T (X scores) are used to predict the Y scores U, and then the predicted Y scores used to construct predictions for the response s

Page 4: PLS and Cross Validation

• The outcome knowledge about the explanatory properties to reduce uncertainty in the target properties.

Computationally two algorithms are used• NIPALS Algorithm• SIMPLS Algorithm

Page 5: PLS and Cross Validation

Terminology of Multiple Regression and PLS

• s is the root mean square (RMS) or standard error• r2 is the proportion of the original variance• F-ratio explained ratio unexplained • residual Actual target property value difference calculated target property value.• equation is the QSAR, a set of coefficients and an

intercept or offset, used for prediction.• prediction = intercept + (explanatory1 * coeff1) +

(explanatory2 * coeff2) + (explanatory3 * coeff3) + …

Page 6: PLS and Cross Validation

• Cross validation with PLS some indices change omitted as meaningless.

• The key difference is in the definition of the s value. • Analyses not involving cross validation– s is the uncertainty remaining after the least-squares fit

has been performed. • In cross validation– s becomes the expected uncertainty in prediction for an

individual compound based on the data available from other compounds in the set;

– in this context, s is the root mean PRedictive Error Sum of Squares (PRESS).

Page 7: PLS and Cross Validation

• the “cross validated r2” (called q2) often much lower than the (conventional) r2 for the same data.

• However, PRESS and q2 are proving to generally be much better indicators than s and r2 of how reliable predictions are.

• The formula for q2 is:

• Ypred = a predicted value• Yactual = an actual or experimental value• Ymean = the best estimate of the mean of all values that might

be predicted the summations are over the same set of Y.• the numerator is PRESS.

Page 8: PLS and Cross Validation

COMPARISON OF PLS WITH MULTIPLE REGRESSION

Major advantages PLS offers over MR are• ability to produce robust equations even when the

number of 'independent variables‘ vastly exceeds the number of experimental observations.

• Predictions more accurate than MR• PLS models are much more stable • can simultaneously derive models for more than

one dependent variable• Much more rapid computation with large data

matrices

Page 9: PLS and Cross Validation

Cross validation

• Cross-Validation (used in PLS)– Remove one or more pieces of input data– Rederive QSAR equation– Calculate omitted data– Compute root-mean-square error to evaluate efficacy of model

• Typically 20% of data is removed for each iteration• The model with the lowest RMS error has the optimal number of

components/descriptors

Page 10: PLS and Cross Validation

Leave-one-out cross validation

• Leave-one-out cross validation (LOOCV) is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set.

• That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point.

• As before the average error is computed and used to evaluate the model.

Page 11: PLS and Cross Validation

Chapter 7 13

Page 12: PLS and Cross Validation

Chapter 7 14

Page 13: PLS and Cross Validation

Chapter 7 15

Page 14: PLS and Cross Validation

Chapter 7 16

Page 15: PLS and Cross Validation

Chapter 7 17

Page 16: PLS and Cross Validation
Page 17: PLS and Cross Validation

• In cross-validation, one value is left out, a model is derived using the remaining data

• the model is used to predict the value originally left out. This procedure is repeated for all values, yielding q2

• q2 is normally (much) lower than r2 and values greater than 0.5 already indicate significant predictive power.

Page 18: PLS and Cross Validation

Thank you