predicting potent compounds via model-based global

Predicting Potent Compounds via Model-Based Global Optimization

Master's thesis by Mohsen Ahmadi

Supervisors

• Prof. Dr. Holger Fröhlich

• Prof. Dr. Stefan Wrobel

1

Research Abstract

• QSAR (Quantitative Structure-Activity Relationship)

• It correlates structure of a chemical compound with its biological effect

• The ultimate goal of QSAR techniques is to identify highly potent compounds

• Interpreting the problem as a global optimization scenario

• Devising expected improvement criterion (EI)

2

Agenda / Topics

• Introduction and Motivation

•Methods

•Key Findings/Results

• Summary and Conclusion

3

Introduction and Motivation 4

Introduction

• Domain of chemical compounds is very large and is raising up every day.

• Due to the effect of them in environment and humans, activity of which should be dissected

• Activity can be described as having or producing an effect on living organism, tissue or cell

• Three different approaches to evaluate the biological activity of a compound:

1) in vivo

2) in vitro

3) in silico

5

Continue …

• The principle hope behind “in silico” approaches is to predict the properties of compounds based on specific models and strategies.

• QSAR modeling is an important example of “in silico” methods

QSAR Modeling

6

Motivation

• The objective of QSAR techniques, in the end, is the recognition or design

of a highly potent compound

• Interpreted as a global optimization problem

• Optimization function of interest:

1) maximizes the compound activity

2) minimizes the number of steps needed for evaluation

7

Continue …

• Shape QSAR model with a machine learning algorithm (GPR)

• Rank compounds w.r.t. their likelihood to be more potent (EI)

• Compare the performance of this technique against:

1) GP

2) NN

3) RA

8

Methods 9

Gaussian Process Regression

• Powerful, non-parametric algorithms in supervised learning

• Problem definition:

Given:

• Data set D consisting of n input vectors x1, x2, …, xn of dimension d

• Corresponding continuous outputs y1, y2, …, yn blurred by normally distributed noise

Wanted• Derive a function f:RD -> R from the given data D

10

Continue …

• Bayesian inference can be applied to infer f from training data

• By definition, a Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution

• Univariate Gaussian distribution:

11

Continue …

• A Gaussian Process can be considered as a generalization of Gaussian

probability distribution over functions and it is fully determined by a mean and covariance function

12

Continue …

• The mean function m(x) is assumed to be zero in most applications

• Covariance function is a type of widely used one, Squared Exponential:

13

Continue …

• The covariance function should be computed for all possible pairs of points

14

Where:

Continue …

• Gaussian process is a set of random variables holding a consistent Gaussian distribution with mean zero

15

• Since y and y* are jointly Gaussian random vectors, then the conditional distribution of y* given y is given by:

Continue …

• Finally, the prediction can be computed via:

16

Continue …

• Hyper-parameters of covariance function

• They play an important role for the performance of regression predictions

17

Comparison of Gaussian Processes with different values of length scale

Continue …

• Following the maximum likelihood principle the goal is to maximize the log marginal likelihood:

18

• From which we can obtain the partial derivative:

Model-Based Global Optimization

• Dealing with global optimization problem is, in general, difficult and expensive

• Occasionally, we have no knowledge about objective function but rather just

some available samples of function values (black-box functions)

• One of common paradigms to cope with black-box functions is the use of response surfaces

• Response surface is simply a surface fitted to observed input-output pairs together with some governed uncertainties

19

EGO Algorithm

• EGO algorithm is a well-known algorithm introduced by “Jones, Schonlau and Welch” in 1998 for solving global optimization problems:

20

1) Fit response surface (Kriging/GP) to the data

2) Extract the highest EI

3) Evaluate the black box function at the point with highest EI

4) Update the model with the new information

5) Iterate until stopping criterion is reached

The key problem to be addressed

Given

• A training set D = {(x1, y1), (x2, y2), …, (xn, yn)} belongs to n chemical compounds along with potency values

Goal• Exploring the response surface for points that are susceptible of having most potent

compound

21

Expected Potency Improvement 22

• Gaussian process regressions are accommodated well for high dimensional data and each prediction is accompanied by uncertainty

• Indeed, we can use them to ask what improvement, over the current best sample, do we expect to get by sampling at any test point

Continue … 23

Continue … 24

• Formally, the improvement I(x*) for a compound is defined as:

• The expectation value of I(x*), which we name expected potency improvement (EI) is defined as follows:

ECSGO Algorithm 25

Modified Variance Estimation 26

• The prediction variance of GPR model was often underestimated

• Leading to an expected potency improvement close to zero in many cases

• yNN is the potency of closest (nearest neighbor) training compound to x* based on Tanimoto Coefficient [-0.333, 1]

Different Comparison Methods 27

• GP (Gaussian Process): The compound with the best predicted potency in test data is chosen.

• NN (Nearest Neighbor): A compound that is nearest to the most potent compound of training data is chosen. Distance measure is based on Tanimoto

Coefficient.

• RA (Random): As it's name suggests, the strategy is based on random selection.

Data Normalization 28

• Data normalization or feature scaling is a common pre-processing task in machine learning algorithms

• Prepares the data before pushing them into the algorithm

• Preparing the data means avoid that features with larger values numerically dominate

• By data normalization, the data are shifted to hold roughly the same mean (e.g. zero) and standard deviation (e.g. one).

Feature Selection 29

• Feature selection is the task of selecting the most relevant subset of features to be used for the learning algorithm and ignoring the rest

• Is effective to reduce dimensionality, ignoring irrelevant and redundant data and increasing learning accuracy

• There exist mainly three classes of feature selection methods:• Wrapper methods

• Embedded methods (variable selection as part of the learning procedure)

• Filter methods

Spearman's rank Correlation 30

• A non-parametric measure of relatedness between two random variables

• It describes the relationship between two variables using a monotonic function

• Spearman's rank correlation is a statistical measure of the strength of a monotonic correlation between paired ranked data: ([-1, 1])

Multiple Testing Correction 31

• In a single hypothesis testing, we reject the null hypothesis if computed p-value is less than a significance level

• However, if we test, for example m hypotheses simultaneously, it is likely to reject the null hypothesis falsely due to chance.

• To address this problem, we used an algorithm proposed by Benjamini and Hochberg (1995) which controls the so-called False Discovery Rate (FDR)

Benjamini-Hochberg Algorithm 32

SAR Index (SARI) 33

• In order to investigate the association of compound similarities and corresponding potencies

• The degree of smoothness of the SAR landscape [0, 1]

• Continuity score defines the potency-weighted structural diversity of compounds [0, 1]

• Discontinuity score reflects the average potency difference among similar compound pairs [0, 1]

Wilcoxon Signed-rank Test 34

• Is designed to evaluate the difference between two distributions in a non-parametric manner

• Hypotheses are stated for this test as follows:

• H0: There is no difference between the two mean ranks (null hypothesis)

• H1: There is a difference between the two mean ranks (alternative hypothesis)

Wilcoxon Signed-rank Test Algorithm 35

Key Findings/Results 36

Datasets and Descriptors 37

• Compound data sets consisting of 12 human targets

• Collected from the ChEMBL database, version 13

• Each data set was composed of 186 numerical 2-dim descriptors computed by

Molecular Operating Environment (MOE) as well as potency values

Simulation Set-up 38

Repeat 25 times Split data into roughly equal size of training and test data Apply normalization Apply feature selection Repeat until no test data is left

• Apply GPR (for EI & GP)• Compute EI values (for EI)• Find a compound with maximum EI (for EI)• Find a compound with maximum predicted mean (for GP)• Find most similar test compound to most similar train compound(for NN)• Find a compound randomly (for RA)• Select a compound based on compound selection strategy• Keep the real potency value of selected compound• Add this compound to training data• Remove it from test data

A Compound with the Best Potency 39

Average number of evaluation steps

to find a compound with the best

potency for each of 12 data sets for

different comparison methods

Three Most Potent Compounds 40

Median number of evaluation steps

to find a compound with the

best(left), second best(middle) and

third best(right) potency over all

data sets for different comparison

methods

Overall Behavior 41

Median number of evaluation steps

to find any of three most potent

compounds over all data sets for

different comparison methods

Wilcoxon signed rank test 42

Significant

reduction to find

most potent

compound

Most potent over

all data sets

Any of three most

potent compounds

over all data sets

EI vs. NN6 out of 12 data

setsP=3.3e-8 P=9.8e-7

EI vs. GP4 out of 12 data

sets P=4.69e-7 P=0.0042

EI vs. RA10 out of 12 data

setsP=2.392e-32 P=2.591e-68

Frequency of Selected Descriptors 43

• To see if the algorithm is in favor of specific descriptors

• Neither any relation between the performance of EI and selected descriptors nor any dependency of the EI method's performance on the selection of specific descriptors.

Influence of Training Set Size 44

Training sets of smaller

sizes led to decreasing the

differences in the number

of search steps to reach

the most potent compound

Summary and Conclusion 45

Summary and Conclusion

• We have introduced a computational model based global optimization strategy to find out maximally bioactive compounds

• With the help of the expected potency improvement criterion, the strategy was efficient to identify most potent compounds

• Overall better search performance (i.e. fewer evaluation steps) compared with nearest neighbor approach that is often applied in virtual screening

46

Appendix 1

Emilio Benfenati, Claire Mays, and Simon Pardoe. Theory, guidance and applications on QSAR and REACH. ORCHESTRA, 2012.

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR, abs/1012.2599, 2010

M. Ebden. Gaussian Processes for Regression: A Quick Introduction. Technical report, Department of Engineering Science, University of Oxford, Aug 2008

Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13:455-492, 1998

Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc. New York, NY, USA, 1997

C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

Yvan Saeys, Inaki Inza, and Pedro Larranaga. A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics, 23(19):2507-2517, September 2007

47

Appendix 2

Lisa Peltason and Juergen Bajorath. SAR Index: Quantifying the Nature of Structure Activity Relationships. Journal of Medicinal Chemistry, 50(23):5571-5578, 2007

H. Froehlich and A. Zell. Efficient Parameter Selection for Support Vector Machines in Classification and Regression via Model-Based Global Optimization. Proc. Int. Joint Conf. on Neural Networks (IJCNN), 3:1431-1438, 2005

Rob Womersley. Local and Global Optimization, Formulation, Methods and Applications. Technical report, School of Mathematics and Statistics University of New South Wales, 2008

Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological), 57:289-300, 1995

Dr Iain Weir. Spearman's Rank Correlation- Introduction. http://www.statstutor.ac.uk/, University of the West of England

ChEMBL. European Bioinformatics Institute (EBI). http://www.ebi.ac.uk/, Accessed September 18, 2012

Molecular Operating Environment (MOE). Chemical Computing Group: 1010 Sherbrooke St. W, Suite 910, Montreal, Quebec, Canada H3A 2R7. http://www.chemcomp.com/, Accessed September 18, 2012

48

http://www.ebi.ac.uk/

predicting potent compounds via model-based global

Documents