predicting potent compounds via model-based global

48
Predicting Potent Compounds via Model-Based Global Optimization Master's thesis by Mohsen Ahmadi Supervisors Prof. Dr. Holger Fröhlich Prof. Dr. Stefan Wrobel 1

Upload: others

Post on 30-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Potent Compounds via Model-Based Global

Predicting Potent Compounds via Model-Based Global Optimization

Master's thesis by Mohsen Ahmadi

Supervisors

• Prof. Dr. Holger Fröhlich

• Prof. Dr. Stefan Wrobel

1

Page 2: Predicting Potent Compounds via Model-Based Global

Research Abstract

• QSAR (Quantitative Structure-Activity Relationship)

• It correlates structure of a chemical compound with its biological effect

• The ultimate goal of QSAR techniques is to identify highly potent compounds

• Interpreting the problem as a global optimization scenario

• Devising expected improvement criterion (EI)

2

Page 3: Predicting Potent Compounds via Model-Based Global

Agenda / Topics

• Introduction and Motivation

•Methods

•Key Findings/Results

• Summary and Conclusion

3

Page 4: Predicting Potent Compounds via Model-Based Global

Introduction and Motivation 4

Page 5: Predicting Potent Compounds via Model-Based Global

Introduction

• Domain of chemical compounds is very large and is raising up every day.

• Due to the effect of them in environment and humans, activity of which should be dissected

• Activity can be described as having or producing an effect on living organism, tissue or cell

• Three different approaches to evaluate the biological activity of a compound:

1) in vivo

2) in vitro

3) in silico

5

Page 6: Predicting Potent Compounds via Model-Based Global

Continue …

• The principle hope behind “in silico” approaches is to predict the properties of compounds based on specific models and strategies.

• QSAR modeling is an important example of “in silico” methods

QSAR Modeling

6

Page 7: Predicting Potent Compounds via Model-Based Global

Motivation

• The objective of QSAR techniques, in the end, is the recognition or design

of a highly potent compound

• Interpreted as a global optimization problem

• Optimization function of interest:

1) maximizes the compound activity

2) minimizes the number of steps needed for evaluation

7

Page 8: Predicting Potent Compounds via Model-Based Global

Continue …

• Shape QSAR model with a machine learning algorithm (GPR)

• Rank compounds w.r.t. their likelihood to be more potent (EI)

• Compare the performance of this technique against:

1) GP

2) NN

3) RA

8

Page 9: Predicting Potent Compounds via Model-Based Global

Methods 9

Page 10: Predicting Potent Compounds via Model-Based Global

Gaussian Process Regression

• Powerful, non-parametric algorithms in supervised learning

• Problem definition:

Given:

• Data set D consisting of n input vectors x1, x2, …, xn of dimension d

• Corresponding continuous outputs y1, y2, …, yn blurred by normally distributed noise

Wanted• Derive a function f:RD -> R from the given data D

10

Page 11: Predicting Potent Compounds via Model-Based Global

Continue …

• Bayesian inference can be applied to infer f from training data

• By definition, a Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution

• Univariate Gaussian distribution:

11

Page 12: Predicting Potent Compounds via Model-Based Global

Continue …

• A Gaussian Process can be considered as a generalization of Gaussian

probability distribution over functions and it is fully determined by a mean and covariance function

12

Page 13: Predicting Potent Compounds via Model-Based Global

Continue …

• The mean function m(x) is assumed to be zero in most applications

• Covariance function is a type of widely used one, Squared Exponential:

13

Page 14: Predicting Potent Compounds via Model-Based Global

Continue …

• The covariance function should be computed for all possible pairs of points

14

Where:

Page 15: Predicting Potent Compounds via Model-Based Global

Continue …

• Gaussian process is a set of random variables holding a consistent Gaussian distribution with mean zero

15

• Since y and y* are jointly Gaussian random vectors, then the conditional distribution of y* given y is given by:

Page 16: Predicting Potent Compounds via Model-Based Global

Continue …

• Finally, the prediction can be computed via:

16

Page 17: Predicting Potent Compounds via Model-Based Global

Continue …

• Hyper-parameters of covariance function

• They play an important role for the performance of regression predictions

17

Comparison of Gaussian Processes with different values of length scale

Page 18: Predicting Potent Compounds via Model-Based Global

Continue …

• Following the maximum likelihood principle the goal is to maximize the log marginal likelihood:

18

• From which we can obtain the partial derivative:

Page 19: Predicting Potent Compounds via Model-Based Global

Model-Based Global Optimization

• Dealing with global optimization problem is, in general, difficult and expensive

• Occasionally, we have no knowledge about objective function but rather just

some available samples of function values (black-box functions)

• One of common paradigms to cope with black-box functions is the use of response surfaces

• Response surface is simply a surface fitted to observed input-output pairs together with some governed uncertainties

19

Page 20: Predicting Potent Compounds via Model-Based Global

EGO Algorithm

• EGO algorithm is a well-known algorithm introduced by “Jones, Schonlau and Welch” in 1998 for solving global optimization problems:

20

1) Fit response surface (Kriging/GP) to the data

2) Extract the highest EI

3) Evaluate the black box function at the point with highest EI

4) Update the model with the new information

5) Iterate until stopping criterion is reached

Page 21: Predicting Potent Compounds via Model-Based Global

The key problem to be addressed

Given

• A training set D = {(x1, y1), (x2, y2), …, (xn, yn)} belongs to n chemical compounds along with potency values

Goal• Exploring the response surface for points that are susceptible of having most potent

compound

21

Page 22: Predicting Potent Compounds via Model-Based Global

Expected Potency Improvement 22

• Gaussian process regressions are accommodated well for high dimensional data and each prediction is accompanied by uncertainty

• Indeed, we can use them to ask what improvement, over the current best sample, do we expect to get by sampling at any test point

Page 23: Predicting Potent Compounds via Model-Based Global

Continue … 23

Page 24: Predicting Potent Compounds via Model-Based Global

Continue … 24

• Formally, the improvement I(x*) for a compound is defined as:

• The expectation value of I(x*), which we name expected potency improvement (EI) is defined as follows:

Page 25: Predicting Potent Compounds via Model-Based Global

ECSGO Algorithm 25

Page 26: Predicting Potent Compounds via Model-Based Global

Modified Variance Estimation 26

• The prediction variance of GPR model was often underestimated

• Leading to an expected potency improvement close to zero in many cases

• yNN is the potency of closest (nearest neighbor) training compound to x* based on Tanimoto Coefficient [-0.333, 1]

Page 27: Predicting Potent Compounds via Model-Based Global

Different Comparison Methods 27

• GP (Gaussian Process): The compound with the best predicted potency in test data is chosen.

• NN (Nearest Neighbor): A compound that is nearest to the most potent compound of training data is chosen. Distance measure is based on Tanimoto

Coefficient.

• RA (Random): As it's name suggests, the strategy is based on random selection.

Page 28: Predicting Potent Compounds via Model-Based Global

Data Normalization 28

• Data normalization or feature scaling is a common pre-processing task in machine learning algorithms

• Prepares the data before pushing them into the algorithm

• Preparing the data means avoid that features with larger values numerically dominate

• By data normalization, the data are shifted to hold roughly the same mean (e.g. zero) and standard deviation (e.g. one).

Page 29: Predicting Potent Compounds via Model-Based Global

Feature Selection 29

• Feature selection is the task of selecting the most relevant subset of features to be used for the learning algorithm and ignoring the rest

• Is effective to reduce dimensionality, ignoring irrelevant and redundant data and increasing learning accuracy

• There exist mainly three classes of feature selection methods:• Wrapper methods

• Embedded methods (variable selection as part of the learning procedure)

• Filter methods

Page 30: Predicting Potent Compounds via Model-Based Global

Spearman's rank Correlation 30

• A non-parametric measure of relatedness between two random variables

• It describes the relationship between two variables using a monotonic function

• Spearman's rank correlation is a statistical measure of the strength of a monotonic correlation between paired ranked data: ([-1, 1])

Page 31: Predicting Potent Compounds via Model-Based Global

Multiple Testing Correction 31

• In a single hypothesis testing, we reject the null hypothesis if computed p-value is less than a significance level

• However, if we test, for example m hypotheses simultaneously, it is likely to reject the null hypothesis falsely due to chance.

• To address this problem, we used an algorithm proposed by Benjamini and Hochberg (1995) which controls the so-called False Discovery Rate (FDR)

Page 32: Predicting Potent Compounds via Model-Based Global

Benjamini-Hochberg Algorithm 32

Page 33: Predicting Potent Compounds via Model-Based Global

SAR Index (SARI) 33

• In order to investigate the association of compound similarities and corresponding potencies

• The degree of smoothness of the SAR landscape [0, 1]

• Continuity score defines the potency-weighted structural diversity of compounds [0, 1]

• Discontinuity score reflects the average potency difference among similar compound pairs [0, 1]

Page 34: Predicting Potent Compounds via Model-Based Global

Wilcoxon Signed-rank Test 34

• Is designed to evaluate the difference between two distributions in a non-parametric manner

• Hypotheses are stated for this test as follows:

• H0: There is no difference between the two mean ranks (null hypothesis)

• H1: There is a difference between the two mean ranks (alternative hypothesis)

Page 35: Predicting Potent Compounds via Model-Based Global

Wilcoxon Signed-rank Test Algorithm 35

Page 36: Predicting Potent Compounds via Model-Based Global

Key Findings/Results 36

Page 37: Predicting Potent Compounds via Model-Based Global

Datasets and Descriptors 37

• Compound data sets consisting of 12 human targets

• Collected from the ChEMBL database, version 13

• Each data set was composed of 186 numerical 2-dim descriptors computed by

Molecular Operating Environment (MOE) as well as potency values

Page 38: Predicting Potent Compounds via Model-Based Global

Simulation Set-up 38

Repeat 25 times Split data into roughly equal size of training and test data Apply normalization Apply feature selection Repeat until no test data is left

• Apply GPR (for EI & GP)• Compute EI values (for EI)• Find a compound with maximum EI (for EI)• Find a compound with maximum predicted mean (for GP)• Find most similar test compound to most similar train compound(for NN)• Find a compound randomly (for RA)• Select a compound based on compound selection strategy• Keep the real potency value of selected compound• Add this compound to training data• Remove it from test data

Page 39: Predicting Potent Compounds via Model-Based Global

A Compound with the Best Potency 39

Average number of evaluation steps

to find a compound with the best

potency for each of 12 data sets for

different comparison methods

Page 40: Predicting Potent Compounds via Model-Based Global

Three Most Potent Compounds 40

Median number of evaluation steps

to find a compound with the

best(left), second best(middle) and

third best(right) potency over all

data sets for different comparison

methods

Page 41: Predicting Potent Compounds via Model-Based Global

Overall Behavior 41

Median number of evaluation steps

to find any of three most potent

compounds over all data sets for

different comparison methods

Page 42: Predicting Potent Compounds via Model-Based Global

Wilcoxon signed rank test 42

Significant

reduction to find

most potent

compound

Most potent over

all data sets

Any of three most

potent compounds

over all data sets

EI vs. NN6 out of 12 data

setsP=3.3e-8 P=9.8e-7

EI vs. GP4 out of 12 data

sets P=4.69e-7 P=0.0042

EI vs. RA10 out of 12 data

setsP=2.392e-32 P=2.591e-68

Page 43: Predicting Potent Compounds via Model-Based Global

Frequency of Selected Descriptors 43

• To see if the algorithm is in favor of specific descriptors

• Neither any relation between the performance of EI and selected descriptors nor any dependency of the EI method's performance on the selection of specific descriptors.

Page 44: Predicting Potent Compounds via Model-Based Global

Influence of Training Set Size 44

Training sets of smaller

sizes led to decreasing the

differences in the number

of search steps to reach

the most potent compound

Page 45: Predicting Potent Compounds via Model-Based Global

Summary and Conclusion 45

Page 46: Predicting Potent Compounds via Model-Based Global

Summary and Conclusion

• We have introduced a computational model based global optimization strategy to find out maximally bioactive compounds

• With the help of the expected potency improvement criterion, the strategy was efficient to identify most potent compounds

• Overall better search performance (i.e. fewer evaluation steps) compared with nearest neighbor approach that is often applied in virtual screening

46

Page 47: Predicting Potent Compounds via Model-Based Global

Appendix 1

Emilio Benfenati, Claire Mays, and Simon Pardoe. Theory, guidance and applications on QSAR and REACH. ORCHESTRA, 2012.

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR, abs/1012.2599, 2010

M. Ebden. Gaussian Processes for Regression: A Quick Introduction. Technical report, Department of Engineering Science, University of Oxford, Aug 2008

Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13:455-492, 1998

Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc. New York, NY, USA, 1997

C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

Yvan Saeys, Inaki Inza, and Pedro Larranaga. A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics, 23(19):2507-2517, September 2007

47

Page 48: Predicting Potent Compounds via Model-Based Global

Appendix 2

Lisa Peltason and Juergen Bajorath. SAR Index: Quantifying the Nature of Structure Activity Relationships. Journal of Medicinal Chemistry, 50(23):5571-5578, 2007

H. Froehlich and A. Zell. Efficient Parameter Selection for Support Vector Machines in Classification and Regression via Model-Based Global Optimization. Proc. Int. Joint Conf. on Neural Networks (IJCNN), 3:1431-1438, 2005

Rob Womersley. Local and Global Optimization, Formulation, Methods and Applications. Technical report, School of Mathematics and Statistics University of New South Wales, 2008

Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological), 57:289-300, 1995

Dr Iain Weir. Spearman's Rank Correlation- Introduction. http://www.statstutor.ac.uk/, University of the West of England

ChEMBL. European Bioinformatics Institute (EBI). http://www.ebi.ac.uk/, Accessed September 18, 2012

Molecular Operating Environment (MOE). Chemical Computing Group: 1010 Sherbrooke St. W, Suite 910, Montreal, Quebec, Canada H3A 2R7. http://www.chemcomp.com/, Accessed September 18, 2012

48