hit rate 30%
DESCRIPTION
The Silicon Chemist: Using logic-based machine learning and virtual chemistry to design new drugs automatically. Department of Computing, Imperial College, and Department of Life Sciences, Imperial College. Christopher Reynolds and Mike Sternberg. Results. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Hit rate30%
Fragmentation of molecules
SVILP generates
QSAR rules
Molecular database
Screen
Novel verified hits
Figure 1. Graphic showing the current method of finding matches for ILP derived rules, as used by the INDDEx software, and this project’s new method.
Screen
Modified molecules
Split molecule with reverse
reaction
Modify using all viable reactions
Novel verified hits on synthesisable
molecules
Hit rate?%
Database of virtual reactions
Pre
dict
ed p
Ka
Logic-based drug discovery• Inductive Logic Programming (ILP) is a machine learning technique, which learns human interpretable qualitative rules from chemical knowledge of active drugs (see Figure 2), which relate structure to activity, and can guide the next steps drug design chemistry.
• Using Partial Least-Squares or Support Vector Machines, the rules can then be weighted
• Weighted rules used as a quantitative model of Quantitative Structure-Activity Relationship (QSAR) to predict drug activity.
• This approach combines the human comprehensible rules of ILP with the predictive accuracy of advanced regression techniques.
• This Support Vector Inductive Logic Programming (SVILP) method is being patented.
• INDDEx (Investigational Novel Drug Discovery by Example) is a proprietary virtual screening program, incorporating SVILP, and developed by Equinox Pharma Ltd., an Imperial College spin-off company.
• The QSAR model is then used to screen a database of all purchasable molecules to identify drug leads.
• In a blind test, INDDEx had a hit rate of 30%, predicting around 30 active molecules, each capable of being the start of a new drug series, and each sufficiently novel that it could be patented.
Christopher Reynolds and Mike Sternberg
The Silicon Chemist: Using logic-based machine learning and virtual chemistry to design new drugs automatically.
Introduction
New drug leads are always needed, and virtual screening is now often used to simulate the bioactivity of compounds, in order to search as much of chemical space as possible in a reasonable amount of time. This project takes an existing logic based machine-learning approach to identifying bioactive compounds, and incorporates chemical synthesis rules, to design novel, easily synthesisable, and effective pharmaceutical drugs.
Figure 5. From left to right: wireframe visualisations of the two reactant molecules entered in SMILES format are processed by the SMIRKS esterification reaction, and the resultant product molecule is returned.
Virtual chemistryTo extend the machine learning method currently used in
INDDEx, and move the process from hit to lead discovery, the logic-based rules produced by SVILP will be used to modify promising hits. This new process is shown in Figure 1. This project will concentrate on scanning through the dataset of purchasable molecules that partially fulfil the rules, and then altering the molecules to try and fit the remaining rules using a
active(A):- positive(A, B), Nsp2(A, C), distance(A, B, C, 2.49, 0.5).
Molecule is active if there is positive charge centre and an sp2 nitrogen atom 2.49±0.5Å. apart
active(A):- phenyl(A, B), phenyl(A, C), distance(A, B, C, 4.1, 0.5).
Molecule is active if there are two phenyl rings 4.1±0.5Å apart.
Figure 2. Example ILP QSAR rules.
ResultsFigure 3. A scatter plot showing observed vs. predicted pKa in a cross validation of a PubChem bioassay. This shows that there are no false positives when using an activity cut-off of pKa 7.
database of virtual chemical synthesis reactions to combine these molecules with a database of fragment-like molecules. Through this method, the program will generate focussed libraries of synthetic derivatives around the promising hit molecules, and so explore a far greater section of easily synthetically accessible chemical space.
to hold true up to the top 5% most active compounds provided to INDDEx as examples.
True negatives False negatives
False positives True positives
Department of Computing, Imperial College, and Department of Life Sciences, Imperial College
Figure 4. Receiver Operating Characteristic (ROC) curve, showing fraction of true positives against true negatives retrieved as the discrimination threshold is varied. ROC values for 1%, 5%, 10% and all actives are 0.912, 0.830, 0.814, and 0.892 respectively. This illustrates the sensitivity of active compound detection
Top 1% (6 actives)
Top 5% (25 actives)
Top 10% (47 actives)
Active/Inactive (232 actives)
False positive rate
True
pos
itive
rate
Observed pKa
Observed activity