predicting potent compounds via model-based global
TRANSCRIPT
Predicting Potent Compounds via Model-Based Global Optimization
Master's thesis by Mohsen Ahmadi
Supervisors
• Prof. Dr. Holger Fröhlich
• Prof. Dr. Stefan Wrobel
1
Research Abstract
• QSAR (Quantitative Structure-Activity Relationship)
• It correlates structure of a chemical compound with its biological effect
• The ultimate goal of QSAR techniques is to identify highly potent compounds
• Interpreting the problem as a global optimization scenario
• Devising expected improvement criterion (EI)
2
Agenda / Topics
• Introduction and Motivation
•Methods
•Key Findings/Results
• Summary and Conclusion
3
Introduction and Motivation 4
Introduction
• Domain of chemical compounds is very large and is raising up every day.
• Due to the effect of them in environment and humans, activity of which should be dissected
• Activity can be described as having or producing an effect on living organism, tissue or cell
• Three different approaches to evaluate the biological activity of a compound:
1) in vivo
2) in vitro
3) in silico
5
Continue …
• The principle hope behind “in silico” approaches is to predict the properties of compounds based on specific models and strategies.
• QSAR modeling is an important example of “in silico” methods
QSAR Modeling
6
Motivation
• The objective of QSAR techniques, in the end, is the recognition or design
of a highly potent compound
• Interpreted as a global optimization problem
• Optimization function of interest:
1) maximizes the compound activity
2) minimizes the number of steps needed for evaluation
7
Continue …
• Shape QSAR model with a machine learning algorithm (GPR)
• Rank compounds w.r.t. their likelihood to be more potent (EI)
• Compare the performance of this technique against:
1) GP
2) NN
3) RA
8
Methods 9
Gaussian Process Regression
• Powerful, non-parametric algorithms in supervised learning
• Problem definition:
Given:
• Data set D consisting of n input vectors x1, x2, …, xn of dimension d
• Corresponding continuous outputs y1, y2, …, yn blurred by normally distributed noise
Wanted• Derive a function f:RD -> R from the given data D
10
Continue …
• Bayesian inference can be applied to infer f from training data
• By definition, a Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution
• Univariate Gaussian distribution:
11
Continue …
• A Gaussian Process can be considered as a generalization of Gaussian
probability distribution over functions and it is fully determined by a mean and covariance function
12
Continue …
• The mean function m(x) is assumed to be zero in most applications
• Covariance function is a type of widely used one, Squared Exponential:
13
Continue …
• The covariance function should be computed for all possible pairs of points
14
Where:
Continue …
• Gaussian process is a set of random variables holding a consistent Gaussian distribution with mean zero
15
• Since y and y* are jointly Gaussian random vectors, then the conditional distribution of y* given y is given by:
Continue …
• Finally, the prediction can be computed via:
16
Continue …
• Hyper-parameters of covariance function
• They play an important role for the performance of regression predictions
17
Comparison of Gaussian Processes with different values of length scale
Continue …
• Following the maximum likelihood principle the goal is to maximize the log marginal likelihood:
18
• From which we can obtain the partial derivative:
Model-Based Global Optimization
• Dealing with global optimization problem is, in general, difficult and expensive
• Occasionally, we have no knowledge about objective function but rather just
some available samples of function values (black-box functions)
• One of common paradigms to cope with black-box functions is the use of response surfaces
• Response surface is simply a surface fitted to observed input-output pairs together with some governed uncertainties
19
EGO Algorithm
• EGO algorithm is a well-known algorithm introduced by “Jones, Schonlau and Welch” in 1998 for solving global optimization problems:
20
1) Fit response surface (Kriging/GP) to the data
2) Extract the highest EI
3) Evaluate the black box function at the point with highest EI
4) Update the model with the new information
5) Iterate until stopping criterion is reached
The key problem to be addressed
Given
• A training set D = {(x1, y1), (x2, y2), …, (xn, yn)} belongs to n chemical compounds along with potency values
Goal• Exploring the response surface for points that are susceptible of having most potent
compound
21
Expected Potency Improvement 22
• Gaussian process regressions are accommodated well for high dimensional data and each prediction is accompanied by uncertainty
• Indeed, we can use them to ask what improvement, over the current best sample, do we expect to get by sampling at any test point
Continue … 23
Continue … 24
• Formally, the improvement I(x*) for a compound is defined as:
• The expectation value of I(x*), which we name expected potency improvement (EI) is defined as follows:
ECSGO Algorithm 25
Modified Variance Estimation 26
• The prediction variance of GPR model was often underestimated
• Leading to an expected potency improvement close to zero in many cases
• yNN is the potency of closest (nearest neighbor) training compound to x* based on Tanimoto Coefficient [-0.333, 1]
Different Comparison Methods 27
• GP (Gaussian Process): The compound with the best predicted potency in test data is chosen.
• NN (Nearest Neighbor): A compound that is nearest to the most potent compound of training data is chosen. Distance measure is based on Tanimoto
Coefficient.
• RA (Random): As it's name suggests, the strategy is based on random selection.
Data Normalization 28
• Data normalization or feature scaling is a common pre-processing task in machine learning algorithms
• Prepares the data before pushing them into the algorithm
• Preparing the data means avoid that features with larger values numerically dominate
• By data normalization, the data are shifted to hold roughly the same mean (e.g. zero) and standard deviation (e.g. one).
Feature Selection 29
• Feature selection is the task of selecting the most relevant subset of features to be used for the learning algorithm and ignoring the rest
• Is effective to reduce dimensionality, ignoring irrelevant and redundant data and increasing learning accuracy
• There exist mainly three classes of feature selection methods:• Wrapper methods
• Embedded methods (variable selection as part of the learning procedure)
• Filter methods
Spearman's rank Correlation 30
• A non-parametric measure of relatedness between two random variables
• It describes the relationship between two variables using a monotonic function
• Spearman's rank correlation is a statistical measure of the strength of a monotonic correlation between paired ranked data: ([-1, 1])
Multiple Testing Correction 31
• In a single hypothesis testing, we reject the null hypothesis if computed p-value is less than a significance level
• However, if we test, for example m hypotheses simultaneously, it is likely to reject the null hypothesis falsely due to chance.
• To address this problem, we used an algorithm proposed by Benjamini and Hochberg (1995) which controls the so-called False Discovery Rate (FDR)
Benjamini-Hochberg Algorithm 32
SAR Index (SARI) 33
• In order to investigate the association of compound similarities and corresponding potencies
• The degree of smoothness of the SAR landscape [0, 1]
• Continuity score defines the potency-weighted structural diversity of compounds [0, 1]
• Discontinuity score reflects the average potency difference among similar compound pairs [0, 1]
Wilcoxon Signed-rank Test 34
• Is designed to evaluate the difference between two distributions in a non-parametric manner
• Hypotheses are stated for this test as follows:
• H0: There is no difference between the two mean ranks (null hypothesis)
• H1: There is a difference between the two mean ranks (alternative hypothesis)
Wilcoxon Signed-rank Test Algorithm 35
Key Findings/Results 36
Datasets and Descriptors 37
• Compound data sets consisting of 12 human targets
• Collected from the ChEMBL database, version 13
• Each data set was composed of 186 numerical 2-dim descriptors computed by
Molecular Operating Environment (MOE) as well as potency values
Simulation Set-up 38
Repeat 25 times Split data into roughly equal size of training and test data Apply normalization Apply feature selection Repeat until no test data is left
• Apply GPR (for EI & GP)• Compute EI values (for EI)• Find a compound with maximum EI (for EI)• Find a compound with maximum predicted mean (for GP)• Find most similar test compound to most similar train compound(for NN)• Find a compound randomly (for RA)• Select a compound based on compound selection strategy• Keep the real potency value of selected compound• Add this compound to training data• Remove it from test data
A Compound with the Best Potency 39
Average number of evaluation steps
to find a compound with the best
potency for each of 12 data sets for
different comparison methods
Three Most Potent Compounds 40
Median number of evaluation steps
to find a compound with the
best(left), second best(middle) and
third best(right) potency over all
data sets for different comparison
methods
Overall Behavior 41
Median number of evaluation steps
to find any of three most potent
compounds over all data sets for
different comparison methods
Wilcoxon signed rank test 42
Significant
reduction to find
most potent
compound
Most potent over
all data sets
Any of three most
potent compounds
over all data sets
EI vs. NN6 out of 12 data
setsP=3.3e-8 P=9.8e-7
EI vs. GP4 out of 12 data
sets P=4.69e-7 P=0.0042
EI vs. RA10 out of 12 data
setsP=2.392e-32 P=2.591e-68
Frequency of Selected Descriptors 43
• To see if the algorithm is in favor of specific descriptors
• Neither any relation between the performance of EI and selected descriptors nor any dependency of the EI method's performance on the selection of specific descriptors.
Influence of Training Set Size 44
Training sets of smaller
sizes led to decreasing the
differences in the number
of search steps to reach
the most potent compound
Summary and Conclusion 45
Summary and Conclusion
• We have introduced a computational model based global optimization strategy to find out maximally bioactive compounds
• With the help of the expected potency improvement criterion, the strategy was efficient to identify most potent compounds
• Overall better search performance (i.e. fewer evaluation steps) compared with nearest neighbor approach that is often applied in virtual screening
46
Appendix 1
Emilio Benfenati, Claire Mays, and Simon Pardoe. Theory, guidance and applications on QSAR and REACH. ORCHESTRA, 2012.
Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR, abs/1012.2599, 2010
M. Ebden. Gaussian Processes for Regression: A Quick Introduction. Technical report, Department of Engineering Science, University of Oxford, Aug 2008
Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13:455-492, 1998
Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc. New York, NY, USA, 1997
C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006
Yvan Saeys, Inaki Inza, and Pedro Larranaga. A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics, 23(19):2507-2517, September 2007
47
Appendix 2
Lisa Peltason and Juergen Bajorath. SAR Index: Quantifying the Nature of Structure Activity Relationships. Journal of Medicinal Chemistry, 50(23):5571-5578, 2007
H. Froehlich and A. Zell. Efficient Parameter Selection for Support Vector Machines in Classification and Regression via Model-Based Global Optimization. Proc. Int. Joint Conf. on Neural Networks (IJCNN), 3:1431-1438, 2005
Rob Womersley. Local and Global Optimization, Formulation, Methods and Applications. Technical report, School of Mathematics and Statistics University of New South Wales, 2008
Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological), 57:289-300, 1995
Dr Iain Weir. Spearman's Rank Correlation- Introduction. http://www.statstutor.ac.uk/, University of the West of England
ChEMBL. European Bioinformatics Institute (EBI). http://www.ebi.ac.uk/, Accessed September 18, 2012
Molecular Operating Environment (MOE). Chemical Computing Group: 1010 Sherbrooke St. W, Suite 910, Montreal, Quebec, Canada H3A 2R7. http://www.chemcomp.com/, Accessed September 18, 2012
48