http://www.simbiosys.ca
eHiTS Score
Darryl Reid, Zsolt Zsoldos, Bashir S. Sadjad, Aniko Simon,
The next stage in scoring function evolution: a new statistically derived empirical scoring function.
http://www.simbiosys.ca
Overview
● eHiTS_Score: new scoring function that takes advantage of the
temperature factors in PDB files to better capture the interaction
geometries between ligands and receptors.
● An "empirical" function is fitted to represent the statistical
interaction data and trained using experimentally derived binding
affinities
● This novel scoring function has the additional benefit of family
training based on automatic clustering of input receptor structures.
● Very good correlation to known binding affinities on very large
and diverse test set of 884 PDB structures
http://www.simbiosys.ca
eHiTS Algorithm● Ligands are divided into rigid fragments
and flexible connecting chains● Rigid Dock: Each fragment is docked
INDEPENDENTLY everywhere in the receptor
● Pose Match: A fast graph matching algorithm finds all matching solutions to reconstruct the original molecule
● Local Energy Optimization: structure is optimized within the receptor
● Ranking: structures are ranked based on scoring function
NO
N
O
N
O
N
O
HN
N
N
O
H2
HN
N
HN
NHN
N
HN
N
HN
N
NH
N
HN
N
H2H
2
H2
H2
H2
H2
H2
Reconnected Ligand Pose:
HN
N
NO
H2
http://www.simbiosys.ca
Novel Approach to Scoring
● In PDB files the given coordinates are derived from a space and time averages of observed positions
● There is a temperature factor that describes the three dimensional probability density of the displacement of the atom from the specified coordinates (the resonance)
● Therefore rather than using the PDB coordinates we have used the probability functions to create a continuous function for interactions
http://www.simbiosys.ca
Interaction Surface Point (ISP) Types
● Interactions can not be described by
distance alone, the angles to the surface
points, shown as LP and H, (α,β) as well
as the torsions between (δ) them must be
considered
HLP
d
αβ
δ
● METAL● CHARGED_HPLUS● PRIMARY_AMINE_HLP● HDONOR● WEAK_HDONOR● CHARGED_LONEPAIR● ACID_LONEPAIR● LONEPAIR
● HYDROPHOB● H_AROM_EDGE● WS_LIPO● NEUTRAL● PI_AROMATIC● PI_RESON_POLAR● PI_RESON_CARBON
● AMBIVALENT_HLP● ROTATABLE_H● ROTATABLE_LP● WEAK_LONEPAIR● PI_SP2_POLAR● PI_SP2_CARBON● HALOGEN● SULFUR
23 Surface point types:
http://www.simbiosys.ca
Interaction Surface Point (ISP) Types
http://www.simbiosys.ca
Statistically derived empirical scoring function
● Gathered interaction statistics from 2500 PDB structures (Gold-
Astex/PDBbind, high resolution <2.5Å)
● The probability of the geometric descriptors (d,α,β,δ) falling into
specific ranges is based on the temperature factors using
volumetric integrals
● Sum the integral values for all observed interactions in the
complexes and deposit into a 4D data array
● 4 variable analytic functions are fitted to the 4D data array
● These functions form the terms of the new scoring function
http://www.simbiosys.ca
Family-based Training
1420 PDB Complexes
eHiTS Training eHiTS
Scoring Functions
2. Complexes are clustered automatically into 97 protein families, plus one default, global set
1. 2500 PDB complexes chosen to represent a wide range of protein families
3. eHiTS training utility optimizes scoring functions (weights) for each family
4. Scoring functions for each family are outputted and used as default scoring functions of eHiTS
http://www.simbiosys.ca
Additional scoring terms
The 276 interaction functions are mapped to 6 weighting factors which are varied during the family-based training. In addition to these the weights of following additional terms are also optimized on a per family basis.
● steric clash (quadratic
penalty function)
● depth value within binding
● solvation
● family-coverage
● conformational strain
energy of the ligand
● intra-molecular
interactions within the
ligand
● entropy loss due to frozen
rotatable bonds
http://www.simbiosys.ca
Tuning the component weights
● Goal function combines 4 terms:
– Convergence of local minimisation (funnel shape)
– Solution pose ranking (identify low RMSD as best)
– Correlation to experimental binding energy
– Separation of actives from decoys (enrichment)
● Stochastic (simulated annealing) + Powell engine
● Overfitting test: tune on half, test on the other half
http://www.simbiosys.ca
Results: Docking 1568 complexes
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.50
20
40
60
80
100
120 RigiDock
Cluster
PoseMatch
DockOptim
TopRank
RMSD from X-ray (Angstroms)
Pe
rce
nt c
om
ple
xes
- Resolution <= 2.5Å- 97 protein families (5+) - 349 singletons- PDB-bind 2004- Astex-GOLD validation
Closest average: 0.73ÅTopRank ave.: 1.10Å
ClosestClosest
Top RankTop Rank
http://www.simbiosys.ca
● eHiTS (far right) docked
59 of the 69 complexes
within 1.5Å of the x-ray
pose and 67 of 69
within 3.5Å,
outperforming the
published[1] results of
the other 5 docking
tools on this set of
proteins 1 Maria Kontayianni, Laura M. McClellan, and Glenn S. Sokol, Evaluation of Docking Performance:
Comparative Data on Docking Algorithms. J. Med. Chem. 2004, 47. 558-565.
Docking accuracy comparison
http://www.simbiosys.ca
Correlation to binding affinity
884 PDB complexesR = 0.75q = 1.61
http://www.simbiosys.ca
VHTS Filter: eHiTS Filter
The eHiTS Filter is based on ligand surface points. All chemically interesting points on the surface of the ligand are assigned surface point types (SPT), indicated by triangles on the histidine ring shown. Each SPT has associated chemical properties (indicated by their color), such as H-bond donor, H-bond acceptor, hydrophobic, π-stacking, etc. The count each of the 23 surface point types creates the feature vector for that ligand.The Filter is based on the assumption that ligands with similar feature vectors have similar activity.
Feature Vector:
Ligand DB
Feature Vectorsactive
inacti
ve
Neural Network
TrainedNetwor
kfile
Feature Vectors
TrainedNetwor
kfile
eHiTS Filter
eHiTS Docking
0.9999
0.0000
Score + pose
Score + pose
Ranked List Re-rankeddocked poses
Ligands
10 21 3
Training eHiTS Filter Screening with eHiTS Filter Docking
http://www.simbiosys.ca
Diversity of Actives and decoys● For each set of actives, the average
feature vectors was calculated
(represented by the blue star)
● The RMSD from this feature vector was
calculated for each active and decoy.
The plot below shows the average
RMSD for the actives and the decoys, as
well as the MAX RMSD for the actives
● For 15 of the 18 codes even the max
RMSD of the actives is less than the
average RMSD of the decoys
x✶
x x
x
x
x
x
x
x x
✶
✶✶
✶
✶✶✶
✶✶✶
x
xx
x
x x
xx
xx
x
x✶x
x
18 31 28 24 32 13 7 52 20 33 54 25 60 11 47 22 5 9
0
0.5
1
1.5
2
2.5
3
3.5
4
RMS deviations from the average feature vector of actives
Max RMSD Active
Ave RMSD Active
Ave RMSD decoy
Family Label
RM
SD
http://www.simbiosys.ca
Enrichment results of eHiTS_Filter
eHiTS_Filter was used to screen a dataset of 869 decoys plus actives (ranging from 5 to 20). The results show remarkable enrichment across a wide range of receptor families, with the average enrichment of ~80% of the actives recovered in the top 10% of the ranked database.
Pham, T.A. and Jain, A.N. Parameter Estimation for Scoring Protein-Ligand Interactions Using Negative Training Data J. Med. Chem., 2005, 10.1021
1ajq 1bzh 1c4v 1e66 1f4g 1fjs 1fmo 1gj7 1pro 1qhc 1rnt 2qwg 2xis 3pcj 3std 4tmn 7cpa 7tim er tk0
0.2
0.4
0.6
0.8
1
1.2
20 Codes out of the 29 Surflex set - screened with eHiTS_Filter
top 10%
top 5%
top 2%
http://www.simbiosys.ca
Scoring Function
Let's define some helper functions:a(x):=P
0*x+ P
1*x2 +P
2*sqrt(x)+ P
3
b(x):=P4*x+ P
5*x2 +P
6*sqrt(x)+ P
7
g(x):=P8*(x-P
9)
c(x):=cos(g(x)) if g(x)>-п and g(x)<п, -1 otherwised(x):=P
10*x+ P
11*x2+ P
12*x3+ P
13*g(x)*g(x)+ P
14*c(x)+ P
15
t(x):=P16
*x+ P17
*x2+P18
*sqrt(x)+ P19
Then the scoring function is:f(,dist,)= a() * b() * d(dist) * t()