ann thesis
TRANSCRIPT
UNIVERSITY OF CALIFORNIA,IRVINE
Protein side-chain prediction algorithm using three layer perceptrons
DISSERTATION
submitted in partial satisfaction of the requirementsfor the degree of
MASTER
in Computer Science
by
Ken Nagata
Dissertation Committee:Professor Pierre Baldi, Chair
Professor Eric MjolsnessProfessor Xiaohui Xie
2010
UMI Number: 1476683
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.
UMI 1476683
Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, MI 48106-1346
c⃝ 2010 Ken Nagata
TABLE OF CONTENTS
Page
LIST OF FIGURES iv
LIST OF TABLES v
ACKNOWLEDGMENTS vi
CURRICULUM VITAE vii
ABSTRACT OF THE DISSERTATION viii
1 Introduction 1
2 Side Chain Prediction 32.1 Rotamer Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Dead-end elimination . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.3 Cyclic Searching . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.4 Integer Linear Programming . . . . . . . . . . . . . . . . . . . 82.3.5 Tree decomposition . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 SCWRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methods 103.1 Training Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Setting Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Rotamer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Removing clashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Handling fixed atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Results 214.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 CPU Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 The number of clashes . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ii
5 Conclusion 28
Bibliography 30
iii
LIST OF FIGURES
Page
3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 The relationship of a length of protein and a prediction time . . . . . 26
iv
LIST OF TABLES
Page
4.1 Accuracies: χ1 and χ2 . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Accuracies: RMSD and Correct Rotamer . . . . . . . . . . . . . . . . 244.3 Time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Accuracy and time for a large protein (1RYP) . . . . . . . . . . . . . 254.5 Accuracy and time for a protein with RNA (1FJG) . . . . . . . . . . 274.6 The number of clashes . . . . . . . . . . . . . . . . . . . . . . . . . . 27
v
ACKNOWLEDGMENTS
I would like to thank very sincerely Dr. Pierre Baldi for his guidance and assistancefor my research. I could learn a lot of things by working with him. I also wouldlike to thank Dr. Arlo Randall for his guidance and assistance. I gained much fromdiscussions with him.
I would like to thank my parents for taking care of me, their financial support andgiving me a chance to pursue my master degree at University of California Irvine.
vi
CURRICULUM VITAE
Ken Nagata
EDUCATION
Master of Science in Computer Science 2010University of California, Irvine Irvine, California
Bachelor of Science in Electrical Engineering 2008Waseda University Tokyo, Japan
Exchange student 2007Arizona State University Tempe, Arizona
PROFESSIONAL MEMBERSHIPS
Institute of Electrical and Electronics Engineers (IEEE)
vii
ABSTRACT OF THE DISSERTATION
Protein side-chain prediction algorithm using three layer perceptrons
By
Ken Nagata
Master in Computer Science
University of California, Irvine, 2010
Professor Pierre Baldi, Chair
Accurate protein side-chain conformation prediction is crucial for protein structure
prediction and protein design methods and a variety of algorithms have been devel-
oped for the task; however, faster and more accurate methods are still required. We
have developed a new prediction method, SIDEpro, which begins by setting the ro-
tamer probabilities for each residue from a backbone dependent rotamer library, then
the probabilities are iteratively updated using a three layer perceptron that takes the
distances to the surrounding atoms and their types as input. When the updated ro-
tamer probabilities all converge the method returns the most likely rotamer for each
residue as the prediction. The resulting models frequently contain steric clashes, thus
we developed a procedure to resolve clashes with minimal change to the prediciton.
SCWRL is widely recognized as the best method for rapid side-chain prediction, and
SCWRL4 is the most recent release. SIDEpro was evaluated using the 379 protein
test set used to evaluate SCWRL4. A training set of 252 proteins with less than 25%
sequence identity to any of the test set proteins was used for training. Using the
SCWRL4 test set, SIDEpro accuracy (χ1 85.92%, χ1+2 74.53%) was slightly better
than SCWRL4-FRM (χ1 85.54%, χ1+2 74.09%) and approximately 5 times faster.
SIDEpro was clearly more accurate than SCWRL4-RRM (χ1 84.35%, χ1+2 71.78%)
and approximately 2 times faster. In addition, on a test set of several very large
viii
protein complexes, SIDEpro is approximately 10 times faster than SCWRL4-FRM
and 7 times faster than SCWRL-RRM.
ix
Chapter 1
Introduction
Protein structure prediction is a fundamental problem in computational molecular
biology and predicting the side-chain conformations for a given fixed backbone is an
important subproblem. Side-chain prediction methods are also critical for a protein
engineering and protein design applications, and as a result many algorithms have
been presented for this problem. SCWRL is one of the best methods and is widely
used right now because it is fast, accurate, and convenient to download and run locally
[2][11].
SCWRL finds a combination of rotamers which minimizes an energy function based
on rotamer probabilities and physical/chemical energy terms [2][11]. There are two
types rotamer libraries, backbone-independent and backbone-dependent. Backbone-
independent rotamer libraries contain information on side-chain dihedral angles and
rotamer populations. Backbone-dependent libraries contain the same information as
a function of the backbone dihedral angles ϕ and ψ [3][4][6][5]. SCWRL, and other
most successful prediction methods, utilize the probabilities from backbone-dependent
rotamer libraries [2][11] combined with van der Waals or simple steric energy, elec-
1
trotatics, and hydrogen bond potentials. In addition, many search methods have
been applied to the side-chain prediction problem, such as dead-end elimination[7],
Monte Carlo[12][16], cyclical search [19] or self-consistent mean field optimization [10],
integer programming [9], and graph decomposition [11][20].
This thesis presents the details of SIDEpro, a new method for side-chain prediction.
SIDEpro uses the output of an artificial neural network (ANN), with a three layer-
perceptron architecture, as the energy function to optimize. For prediction, it initially
sets the probability of rotamers for each residue using a backbone dependent rotamer
library [3][4][6][5], Then, it iteratively updates the rotamer probabilities using the
ANN, and sets the most probable rotamers as the predictions when the rotamer
probabilities all converge. SIDEpro also utilizes a post-processing procedure to reduce
steric clashes because the energy function allows severe clashes in some cases.
The benchmark results show that SIDEpro is more accurate than the latest version of
SCWRL, SCWRL4 [11] and the resulting models have fewer steric clashes than those
produced by SCWRL. In addition, SIDEpro is about five times as fast as SCWRL.
Since the speed of SIDEpro only increases linearly with the number of residues, it
can be applied to very large proteins and protein complexes.
In this thesis, Chapter 2 introduces some side chain prediction algorithms presented
before. Chapter 3 explains the details of training and prediction methods by SIDEpro.
Chapter 4 discusses the accuracies and speed of SIDEpro comparing to SCWRL. The
thesis is concluded in Chapter 5.
2
Chapter 2
Side Chain Prediction
Side chain prediction problem is a subproblem of protein structure prediction and
predicts a structure of side chain conformation when backbone structure is given.
The side chain conformation is defined by the one to four dihedral angle(s) which is
called χ angles while the backbone is defined by the two dihedral angles which are
called ψ and ϕ angles. The number of χ angles depended on an amino acid’s type.
Thus, the side chain conformation problems can be seen as a prediction of χ angles
for each residue given ψ and ϕ angles. In this chapter, the previous methods of the
side chain prediction will be discussed.
2.1 Rotamer Library
Since the Protein Data Bank (PDB) has grown, a lot of analyzed protein structures
have been available. [1] From this large amount of data, analyzing side chain confor-
mation has been done to get the distribution of side chain conformation. These side
chain conformations called ”rotamer” and rotamer libraries have been established
3
that show their distribution.
Since those side chain conformations are defined by the dihedral angles, or χ angles,
the rotamer library describes distributions of χ angles. There are two types rotamer li-
braries: backbone-independent and backbone-dependent. The backbone-independent
rotamer libraries contain information on side-chain dihedral angles and rotamer pop-
ulations. That is, this type of library does not contain information about backbone.
The backbone-dependent libraries contain the same information as a function of the
backbone dihedral angles, ϕ and ψ. In other words, the backbone-dependent libraries
show the conditional distribution of χ when ϕ and ψ are given (p(χ|ϕ, ψ)).
For the side chain conformation problem, since the backbone conformation is given,
backbone-dependent rotamer library is often used. To simplify the problem, most
of side chain conformation prediction algorithm assigns rotamer from the library for
each residues such that total energy becomes low. They utilize the probability of
rotamer as a self energy, or energy inside the residue.
2.2 Energy function
The energy calculation is important for the side chain conformation since side chains
are placed such that energy becomes small. Hence, the side chain prediction tries to
find a combination of rotamers which produce a small or minimum energy.
Various energy functions have been applied to improve side chain prediction. The
hydrophobic, electro-statistic or hydrogen bond’s energies are commonly used. Those
energy functions are applied not only for side chain prediction but for other appli-
cations such as molecular dynamics [15][18], docking [8] or other protein structure
prediction problems [17]. The parameters for their energies are sometimes optimized
4
within their training data, or are set to be force fields parameters such as CHARMM
[18] or Amber [15].
To represent a hydrophobic energy or van der Waals energy, the Lennard-Johns po-
tential is sometimes applied[9][10][11]. In general, the Lennard-Johns potential is
given by
ELJ(r) = 4ϵ[(σr
)m−(σr
)n](2.1)
where ϵ is the depth of potential well, σ is the distance at which the potential is zero,
r is a distance between the atoms and m and n are parameters which are often set to
be 12 and 6 respectively.
The function proposed by [2] is sometimes used for the side chain prediction instead
of the Lennard-Johns potential. This function is given by
Esteric(r) =
0, r > Rij
10, r < 0.825Rij
57.273
(1− r
Rij
), 0.825Rij ≤ r ≤ Rij
(2.2)
where r is the distance between the atoms, and Rij is the sum of the hard-sphere
radii for atoms i and j. This function represents repulsive part of the van der Waals
energy in a linear form.
For electro static energy, Coulomb’s low has been applied[12]. From the Coulomb’s
low, potential between two charges q1 and q2 is given by,
Eelec(r) =1
4piϵ0
q1q2ϵr
, (2.3)
5
where ϵ0 is electric constant, ϵ is dielectric constant, and r is a distance between the
charges. The ϵ sometimes set as a function of the distance r.
Other potentials has been applied for the side chain prediction such as hydrogen bond
[11] or disulfide bond [2].
2.3 Search methods
Most of the side chain prediction algorithms try to solve the following problem.
r = argminEtotal(r) (2.4)
where
Etotal(r) =N∑i=1
Eself (ri) +N−1∑i=1
N∑j>i
Epair(ri, rj), (2.5)
r = {r1, r2, . . . , rN}, r is optimal rotamers which produce lowest energy, ri is a
rotamer at residue i, Eself (r) is a self energy of the rotamer r, and Epair(ri, rj) is a
energy between rotamer ri and rj. To solve this, a lot of search methods has been
applied for the side chain prediction. In this section, dead-end elimination[7][9], Monte
Carlo[12][16], cyclical search [19], integer programming [9], and graph decomposition
[11][20] will be briefly introduced.
2.3.1 Dead-end elimination
A dead-end elimination (DEE) is a method to find a global minimum energy by
eliminating rotamers which always cause higher energy than one of the other rotamers.
6
That is, the rotamer si is eliminated if
Eself (si)− Eself (ri) +N∑
j=1,j =i
minrj
[Epair(si, rj)− Epair(ri, rj)] > 0 (2.6)
where ri is the another rotamer. When the equation 2.6 is satisfied, the rotamer si
always has a higher energy than rotamer ri regardless of which rotamer is chosen for
the other side chains. Hence, the rotamer si can be eliminated from the search. This
process can be done efficiently by applying DEE from highest to lowest Eself . [2]
2.3.2 Monte Carlo
Monte Carlo method is a sampling method applying for calculating the integration,
finding the distribution, or optimization. It repeated random sampling to get the re-
sults. For side chain prediction, Metropolis Monte Carlo simulated annealing method
has been applied. In this method, at first, select a position randomly, then one ro-
tamer rnew at the position is selected at random. The new interaction energy Enew is
calculated and if it is lower than previous energy Eold, new rotamer rnew is accepted
as a rotamer at the position. If it is higher than Eold, then rnew is accepted with
the probability exp[(Eold − Enew)/T ] where T is called temperature. If rnew is not
accepted, the rotamer at the position does not change. This process is iterated many
times. When the temperature T is fixed, this method is called a Metropolis method.
Since this method sometimes falls into local minimum, a simulated annealing algo-
rithm is applied. Simulated annealing algorithm changes the value of the temperature
T . At first, T is set to be high, then, decrease it gradually. By doing this, samples
would avoid the local minimum more easily when the temperature is high, and get
samples around the global minimum.
7
2.3.3 Cyclic Searching
Cyclic searching method is one of the greedy method. For each residue, this method
calculates energies for all rotamers and set the rotamer which is lowest energy. Then
iterate this process until they converge. This method would fall into a local minimum
and does not guarantee an optimal solution.
2.3.4 Integer Linear Programming
Integer programming method converts a side chain conformation problem to an inte-
ger linear programming (ILP) problem. This method formulates side chain prediction
problems by the following.
Minimize
E =N∑i=1
Eself (ri)xriri +N−1∑i=1
N∑j>i
Epair(ri, rj)xrirj
subject to ∑ri∈Ri
xriri = 1 for i = 1, . . . , N∑ri∈Ri
xrirj = xriri for i = 1, . . . , N and i = j
xriri , xrirj ∈ [0, 1]
Then this can be solved as ILP.
2.3.5 Tree decomposition
The residue interaction can be seen as graph G by representing each residue by a
vertex and each interaction between two residues by an edge. A tree decomposition
8
maps the graph G into a tree T which is a another graph with no cycle. From
the tree T , searching for the optimal side chain assignment is possible based on the
decomposition.[20]
2.4 SCWRL
SCWRL is one of the best methods for a side chain prediction and is widely used
because it is fast, accurate, and easy to download and run locally. As energy functions,
a newest version of SCWRL, SCWRL4, is using a Lennard-Johns potential with some
modifications as steric energy, hydrogen bond energy and disulfide bond energy. By
utilizing DEE and a tree decomposition for search, it achieves fast and accurate
prediction and guarantees an optimal solution.
For SCWRL4, there are two models for prediction: Rigid Rotamer Model (RRM) and
Flexible Rotamer Model (FRM). RRM only considers the center position of rotamers
(mean of χ angles) , while FRM takes into account not only the center but subrotamers
which are the center of rotamers plus and minus standard deviation of rotamers. By
considering the subrotamers, the prediction accuracies improved because FRM allows
more various positions of side chains. Since the search space for FRM is larger than
that of RRM, the speed of FRM is slower than that of RRM. Hence, FRM is more
accurate but slower while RRM is faster but less accurate model.
9
Chapter 3
Methods
SIDEpro is a new algorithm to predict side chain conformations. SIDEpro is using
trained energy function by three layer percepron and new search method. This chapter
describes the architecture of the artificial neural network for energy functions, how
to train them, how to predict rotamers and the details of the algorithm.
3.1 Training Energy Function
The artificial neural network (ANN) architecture utilized by SIDEpro is a three layer
perceptron. The output of an ANN is a function of input when the weights are given.
The inputs to the ANN are the distances between pairs of atoms, and the output of
the ANN is used as the energy between the atom pairs. The dimension of both input
and output is one. Let d be the distance between two atoms. When the distance
between two atoms is large enough they are not considered to be interacting in our
model, thus we set the energy to zero when the distance is larger than 7 A. Then, the
10
w11
Bias Bias
w1H
w12
w13
w21
w22
w23
w2H
w31
w41
w42
w43
w4H
e=f(d;w) d
Figure 3.1: Network Architecture
energy e is given by
e = f(d;w) =
H∑
h=1
w4hσ(w2hd+ w1h) + w31, d < 7
0, otherwise
(3.1)
where w contains the weights of the ANN, σ is a sigmoid function, and H is the
number of hidden states. The ANN architecture is shown by Figure (3.1). Although
a sigmoid function is usually applied to the output layer, it is not applied in this case
since the output is not bounded from 0 to 1. A separate ANN was trained for each
amino acid type and for each pair of atom types. The set of training proteins was
used to set the weights of the model with the Markov Chain Monte Carlo method. By
using the energy output from the ANN, the total energy for each rotamer is calculated
as follows.
eij =
Ni∑k=1
L∑l=1,l =i
Nl∑n=1
faibikbln(||Xijk −Xlrln||;w) (3.2)
11
where
eij : total energy for rotamer j at residue i
Ni : the number of atoms of residue i
L : the length of amino acids sequence
Xijk : coordinate of k-th atom of rotamer j of residue i
ai : amino acid type of residue i
bik : atom type of k-th atom of residue i
fabc(·) : energy function or output of neural network for amino acid a, atom b
and atom c
Before training, we set all rotamers at the center position (mean of χ angles) of the
correct rotamer which is defined by the following:
ri = argmaxj
p(χi|µij ,σij)pij (3.3)
where χi is a vector of χ angles of residue i from a crystal structure, µij and σij are
vectors of mean and standard deviation of χ angles of rotamer j for residue i from
the rotamer library and pij is a probability of the rotamer from the library. This is
done because this new structure is the one we want to predict.
The probability of rotamer at residue i is defined by the following using energy from
equation (3.2) when the weights of neural network w are given.
P (ri|w) =exp(−Keiri)piri∑Ri
j=1 exp(−Keij)pij(3.4)
12
where Ri is the number of rotamers for residue i and K is constant value. This
equation can be seen as a likelihood function. Hence posterior distribution of w is
given by the following.
P (w|r) =∏i
P (ri|w)P (w) (3.5)
where P (w) is a prior distribution of w. Applying Markov Chain Monte Carlo to
get samples of the weight parameter of three layer perceptrons was done.[13][14] We
assumed that prior distribution of w follows a normal distribution N(0, 1α). Hence,
P (w|α) =4∏
i=1
√αi
2πexp(−αi
2
∑j
w2ij) (3.6)
where α is a vector of hyper parameters and each αi controls the distribution of
wi. Since hyper parameter α should be non-negative value, we assumed each α are
independent and follows gamma distribution such that:
P (α) =4∏
i=1
αk−1i
exp(−αi/θ)
θkΓ(k)(3.7)
k and θ are constants and parameters for gamma distribution. Hence, posterior
distribution of w and α is given by the followings.
P (w|r,α) =
(4∏
i=1
√αi
2π
)exp
(−∑i
αi
2
∑j
w2ij +
∑i
log p(ri|w)
)(3.8)
P (αi|r,w) ∝ αk−1i exp
(−αi
(θ−1 +
1
2
∑j
w2ij
))∝ Gamma
k,(θ−1 +1
2
∑j
w2ij
)−1
(3.9)
13
The distributions of weights w and hyper parameter α are complex and it is impos-
sible to calculate it analytically. To solve this problem, we applied Markov Chain
Monte Carlo method to get samples of posterior distributions of w and α, that is,
p(w,α|r). We used the following two types of sampling methods to get those samples.
1 Sampling α(j) by Gibbs sampling method
2 Sampling w(j) by Metropolis method
By iterating this process, we can get samples of the posterior distributions. Hence,
we can get samples of α(j) by
α(j+1)1 ∼ P (α1|α(j)
2 , α(j)3 , α
(j)4 ,w(j)) (3.10)
α(j+1)2 ∼ P (α2|α(j+1)
1 , α(j)3 , α
(j)4 ,w(j)) (3.11)
α(j+1)3 ∼ P (α3|α(j+1)
1 , α(j+1)2 , α
(j)4 ,w(j)) (3.12)
α(j+1)4 ∼ P (α4|α(j+1)
1 , α(j+1)2 , α
(j+1)3 ,w(j)) (3.13)
Samples of w(j) are given by the following.
1 get candidate sample by
w∗ ∼ N(wj, σ2) (3.14)
2 accept sample w∗ as w(j+1) with probability min[
P (w∗)
P (w(j)), 1]
14
We used the following optimal weight w∗ as a weight parameter for prediction.
w∗ = argmaxw(j)
P (w(j)|r,α) (3.15)
3.2 Setting Neighbors
Before predicting rotamers, SIDEpro finds neighbor’s residue for each residue to re-
duce calculation cost. Since SIDEpro set energy to zero when the distance is larger
than seven as defined by equation (3.1), we can ignore residue pairs if they are far
enough. We defined two types of neighbors for each residue: backbone Nghbb and
side-chain Nghsc. For backbone neighbors, the backbone of residues must be close
enough to cause energy with the residue. On the same way, at least one pair of side
chain of the rotamers must be close enough for side chain neighbors.
At first, we set three types of boxes; boxbb(i) and boxsc(i) for residue i, and boxatom(i, k)
for atom k at residue i. They are defined by the following.
boxbb(i) =
{X| min
k∈BackboneXi1k − 3.5 ≤ X ≤ max
k∈BackboneXi1k + 3.5
}(3.16)
boxsc(i) =
{X| min
j,k∈SidechainXijk − 3.5 ≤ X ≤ max
j,k∈SidechainXijk + 3.5
}(3.17)
boxatom(i, k) =
{X|min
jXijk − 3.5 ≤ X ≤ max
jXijk + 3.5
}(3.18)
From this, we can say boxbb(i), boxsc(i) and boxatom(i, j) contain all backbone atoms
at residue i, all side chain atom coordinates for all rotamers at residue i, and all
15
positions of atom k for all rotamers at residue i respectively. Furthermore, if two
boxes does not intersect, a distance between two atoms from each of them is greater
than 7. Now, we find neighbors by the following. For backbone neighbors Nghbb, if
boxsc(i) and boxbb(j) have intersections for i = j, add {j} to Nghbb(i). For side chain
neighbors Nghbb, check if boxsc(i) and boxsc(j) have intersections. If they do and if
there are at least one intersection between boxatom(i, k) and boxatom(j, l) for all k and
l, add {j} to Nghsc(i). We can get all boxes in O(∑NiR) and define neighbors in
O(L2N2).
3.3 Rotamer Prediction
For prediction, SIDEpro tried to find probability of rotamers for each residue. Before
finding the probabilities, SIDEpro removes rotamers which clash with backbone. We
defined that rotamer clashes with backbone if
Evdw(i, j) =
∑k∈Sidechain
∑l∈Nghbb(i)
∑n∈Backbone
(rbik + rbln − ||Xijk −Xl1n||)2,
||Xijk −Xl1n||rbik + rbln
< 0.67
0, otherwise
(3.19)
is greater than 1.0. rb represents a radius of atom b. This process reduces the size
of search space. Then it sets neighbors as the way discussed above. As next step, it
calculates energy for side chain with backbone Esb and with side chain Ess. Finally,
it calculates probability of rotamers then set to the most probable one as prediction.
Algorithm 1 shows SIDEpro algorithm to calculate probability of rotamers.
16
Algorithm 1 SIDEpro algorithm
Set boxbb, boxsc and boxatom
Set Nghbb and Nghsc
//set qij to be pij and remove clashing rotamersfor all residue i do
for all rotamers at residue j doqij=pijcalculate Evdw(i, j) by equation (3.19)if Evdw(i, j) > 1.0 then
remove rotamer jend if
end forend for//calculate Esb
for all residue i dofor all rotamers j at residue i do
Esb(i, j) =∑
k∈Sidechain
∑l∈Nghbb(i)
∑n∈Backbone
faibikbln(||Xijk −Xl1n||;w∗)
end forend for//calculate Ess
for all residue i dofor all rotamers j at residue i do
for all l ∈ Nghsc(i) dofor all rotamers m at residue l do
Ess(i, j, l,m) =∑
k∈Sidechain
∑n∈Sidechain
faibikbln(||Xijk −Xlmn||;w∗)
end forend for
end forend for
17
//calculate qijwhile qij does not converge do
for all residue i do//update energy of rotamers at residue ifor all rotamers j at residue i do
eij = Esb(i, j) +∑
l∈Nghsc(i)
Rl∑m=1
qlmEss(i, j, l,m)
end for//update probability of rotamers at residue ifor all rotamers j at residue i do
qij =exp(−Keij)pij∑Ri
k=1 exp(−Keik)pikend for
end forend whilefor all residue i do
ri = argmaxj
qij
end for
where
qij : probability for rotamer j at residue i
pij : probability of rotamer j at residue i from the library
ri : predicted rotamer for residue i
18
3.4 Removing clashes
The initial models produced by SIDEpro use only the means of the rotamers, thus it
can be considered a rigid rotamer model. When a rigid rotamer model is used, some
clashes are almost always observed in a compact protein model. In fact, the structures
used for training contain some clashes because the rotamers were set to the center of
the closest rotamer in the library. In addition, since the SIDEpro energy function is
not based on physical or chemical energy, but rather is learned from training data,
the initial models produced by SIDEpro frequently contain relatively high numbers of
steric clashes. To resolve these clashes we developed a post-processing method which
allows some flexibility of the rotamers.
We considered atoms to be clashing if the distance between them was than 0.8 * (sum
of vDW radii). We significantly reduced the number of clashes using the following
protocol. First, each residue is checked for clashes. For each residue that has one or
more clashing atoms, change chi angles to mean of chi µ plus or minus the standard
deviation σ multiplied by n, measure distances of each atoms and find the minimum
distance. Then, set chi angles such that minimum distance is largest among the
combination of µ ± nσ. Repeat this operation until it converges. Then increase
the value of n and repeat the process again. We set n = 0.5, 1.0 and 1.5. This
method reduced the number of clashes with only a minimal effect on accuracy. The
time and accuracy results presented in the next section are calculated only from the
clash-reduced models.
19
3.5 Handling fixed atoms
SIDEpro can fix certain atoms by using an optional parameter in a same manner as
SCWRL. By using this option, not only side chains of proteins but also DNA, RNA
and other types of molecules can be considered to predict the side chain conformations.
Since SIDEpro can only handle C, O, N, H and S atoms, we treated other types of
atoms as C for fixed atoms.
All fixed atoms are treated in a similar way to how backbone atoms are treated.
Hence, SIDEpro builds boxes and neighbors for them, and calculates the energy Esb
in the same way as backbone atoms. The only difference is that they are not included
in the calculation of Evdw which defines clashes between side chains and backbones.
20
Chapter 4
Results
For a side chain prediction, accuracy and time are important. We evaluated and
compared SIDEpro and SCWRL using some evaluation methods. In this chapter, a
detail of training and test dataset, evaluation methods, accuracies, CPU time, and
the number of clashes will be shown.
4.1 Data
We used 252 pdb files as the training data to obtain the ANN weight parameters
and the 379 protein SCWRL4 test set to evaluate SIDEpro. For the training data,
we obtained the list of proteins using PISCES server, at maximum mutual sequence
identity 30%, ≤ 1.8 A resolution, and maximum R-factor of 20 %. Then we randomly
picked up 252 files from the list excluding proteins with more than 25% sequence
identity with one or protein in the test set.
The published SCWRL3 back-bone dependent rotamer library was used for prediction
[2][3][4][6][5].
21
4.2 Accuracy
Four methods were used to evaluate the results. The first two methods are χ1 and
χ1+2, which are widely used methods for evaluation of side-chain prediction algo-
rithms. χ1 within 40 degrees compared to the crystal structure is considered correct
for the χ1 method. For χ1+2, both χ1 and χ2 within 40 degrees is considered correct.
The third evaluation measure is RMSD. There are two ways to calculate RMSD; one
is calculate RMSD for each residue and take average, and another one is calculate
entire RMSD. We used former definition.
The last method is an evaluation method we defined. We called this method ”correct
rotamer”. We compared the correct rotamer defined by
ri = argmaxj
p(χi|µij ,σij) (4.1)
for crystal structure and predicted structure. We considered the side chain confor-
mation is predicted correctly when r are same.
Table 4.1 summarizes the accuracy results for SCWRL4 [11] and SIDEpro. SCWRL4
has options to choose Flexible Rotamer Model (FRM) and Rigid Rotamer Model
(RRM). We performed tests using both settings and found that FRM is consistenly
more accurate but slower than RRM, which agrees with [11]. Thus, the detailed
accuracies presented for SCWRL4 are those calculated using the FRM setting. The
SIDEpro results are slightly better according to all evaluation methods. The accura-
cies for correct rotamer, χ1 and χ1+2 are 66.41%, 85.92% and 74.53% respectively.
22
Table 4.1: Accuracies: χ1 and χ2
Residue typeχ1 χ1+2
SIDEpro SCWRL4 SIDEpro SCWRL4CYS 90.53 89.91 —– —–ASP 84.33 83.57 65.01 65.18GLU 73 72.8 57.58 56.83PHE 96.3 95.42 92.61 92.89HIS 88.96 88.31 77.79 77.31ILE 95.37 95.84 83.06 84.33LYS 79.87 77.3 67.41 63.73LEU 93.5 92.3 86.69 86.33MET 83.96 84.23 73.19 70.85ASN 85.7 84.17 76.76 74.17PRO 83.9 85.73 —– —–GLN 77.89 79.35 60.26 59.67ARG 79.13 78.33 64.78 65.57SER 72.38 70.52 —– —–THR 90.03 90.46 —– —–VAL 92.85 92.92 —– —–TRP 93.29 92.7 82.67 80.26TYR 93.09 95.19 89.13 92.55ALL 85.92 85.54 74.53 74.09
4.3 CPU Time
The speed is an important characteristic of a side chain prediction method. Table 4.3
shows mean, median, and maximum prediction times of both SIDEpro and SCWRL4
on the 379 protein test set. The timing results using both FRM and RRM are
shown for SCWRL4. SIDEpro is approximately five times faster than SCWRL4-
FRM according to the mean, and four times faster for the median. The maximum
time of SIDEpro is about 1/10 that of SCWRL4-FRM.
SIDEpro is approximately 40% faster than SCWRL4-RRM according to the mean,
although the SCWRL4-RRM median is less than that of SIDEpro. The maximum
time of SIDEpro is approximately 1/7 that of SCWRL-RRM.
23
Table 4.2: Accuracies: RMSD and Correct Rotamer
Residue typeRMSD Correct Rotamer
SIDEpro SCWRL4 SIDEpro SCWRL4CYS 0.347 0.354 90.78 90.09ASP 0.865 0.860 58.68 55.69GLU 1.459 1.454 30.96 30.28PHE 1.096 1.134 88.98 88.12HIS 1.198 1.217 52.78 51.94ILE 0.449 0.384 82.84 84.15LYS 1.423 1.507 36.58 32.16LEU 0.528 0.502 86.42 86.22MET 1.042 1.089 54.4 54.08ASN 0.912 0.937 46.94 43.06PRO 0.294 0.216 79.61 81.59GLN 1.472 1.472 31.3 31.08ARG 2.200 2.374 26.13 30.44SER 0.549 0.572 72.61 70.76THR 0.352 0.323 90.16 90.62VAL 0.287 0.268 92.92 92.96TRP 1.158 1.271 81.17 78.63TYR 1.346 1.217 85.42 87.88ALL 0.821 0.824 66.41 65.97
Figure 4.1 shows the relationship between the number of residues in a protein or
protein complex and the prediction time. For SIDEpro, the time increases linearly
according to the number of residues with an R squared value of 0.9205.
For both SCWRL methods, the time generally increases with the number of residues,
but the relationship is much less predictable. The Rsq values for SCWRL4-FRM is
0.5452 and for SCWRL4-RRM it is 0.2097.
To further analyze the prediction times of the methods, we teseted them on a small set
of large protein complexes curated from the Protein Data Bank. The table 4.4 shows
accuracies and time for the protein. From the table, SIDEpro is better for the accuracy
for correct rotamer and χ1 and SCWRL (FRM) is better for χ1+2 and RMSD. For
24
time, SIDEpro is one tenth of SCWRL (FRM) and one seventh of SCWRL (RRM).
Hence, SIDEpro predicted fast with almost same accuracy comparing to SCWRL
(FRM).
We also tested for the protein in complex with RNA. Since both SIDEpro and SCWRL
have an option to fix some atoms such as DNA or RNA, we compared the accura-
cies and time for the protein complex when predicted with and without fixed RNA
structure. The table 4.5 shows the accuracies and time for the protein. From the
table, it is evident that the accuracies are better for all algorithms when predicted
with the RNA structure; however, the prediction times increased. When comparing
the algorithms, the prediction of SIDEpro with the RNA shows the highest accura-
cies for both χ1 and χ1+2. The prediction time for SIDEpro only increased 17.7%
by including the RNA structure while there were 2136.1% and 680.3% increases for
SCWRL(FRM) and SCWRL(RRM) respectively.
The calculations were performed on a machine with an AMD Turion 64 X2 Mobile
Technology TL-60+ at 2.00 GHz, with 3.00 GB of RAM, and running by 32-bit
Microsoft Windows Vista Home Premium Service Pack 2.
Table 4.3: Time comparison
Program Mean(sec) Median(sec) Max(sec)SIDEpro 2.79 2.33 11.312
SCWRL (FRM) 13.23 9.19 108.66SCWRL (RRM) 4.54 1.55 77.63
Table 4.4: Accuracy and time for a large protein (1RYP)
Algorithm Correct Rotamer χ1 χ1+2 RMSD timeSIDEpro 62.61 81.96 71.2 0.921 72.9
SCWRL(FRM) 62.58 81.82 71.47 0.908 781.1SCWRL(RRM) 59.47 80.63 68.55 0.950 525.1
25
0 200 400 600 800 1000 1200
020
4060
8010
0
0 200 400 600 800 1000 1200
020
4060
8010
0
0 200 400 600 800 1000 1200
020
4060
8010
0
Length
Tim
e(se
c)
SIDEproSCWRL4 FRMSCWRL4 RRM
Figure 4.1: The relationship of a length of protein and a prediction time
26
Table 4.5: Accuracy and time for a protein with RNA (1FJG)
Algorithm condition χ1 χ1+2 timeSIDEpro isolated 0.71 0.51 11.42SIDEpro with RNA 0.73 0.53 13.44
SCWRL(FRM) isolated 0.71 0.5 179.78SCWRL(FRM) with RNA 0.72 0.52 4020SCWRL(RRM) isolated 0.7 0.48 102SCWRL(RRM) with RNA 0.71 0.51 795.89
4.4 The number of clashes
Due to using trained energy function, SIDEpro applied removing clashes algorithm
at the end of prediction. Table 4.6 shows the number of clashes for the final SIDEpro
models, the initial SIDEpro models before the clash reduction procedure, SCWRL4-
RRM, and SCWRL-FRM, and crystal structure. When the clash removing protocol
is applied, the SIDEpro models have significantly fewer steric clashes than SCWRL
models.
Table 4.6: The number of clashes
SIDEpro SIDEpro(before removing clahses) SCWRL Crystal12895 39558 18009 2031
27
Chapter 5
Conclusion
When compared to SCWRL4, the side-chain prediction algorithm presented here,
SIDEpro, shows slightly better results in terms of accuracy and significant improve-
ment in terms of speed. The ANN energy functions based on atom-atom distances
are a novel aspect of the overall algorithm presented here and we have shown that
they can be used to help improve the accuracy and speed. Although SIDEpro only
used distance of atoms to calculate energy, it shows better result than SCWRL which
uses more information such as angles among atoms to calculate hydrogen bonding
energy.
Regarding directions for improving the current prediction, it is possible to add more
information to the ANNs by increasing dimensions of input or constructing new net-
works for new information. Flexibility of side chains around rotamer dihedral angle
values and standard bond length and angles are important for this problem. The
more flexible model improved the accuracy. The flexible rotamer model (FRM) is
the most accurate SCWRL predictor. Another possible direction is to build a flexible
rotamer model version of SIDEpro by changing likelihood function. We expect this
28
would increase the accuracy at the expense of speed, as is observed with the SCWRL4
versions.
For training, we applied Markov Chain Monte Carlo. Since we used the weights
which maximize the likelihood, the parameters may have been overfitted to training
data. It is also possible that the the parameters would represent a local maximum.
These issues could be avoided, or at least reduced, by taking the average over many
samples or increasing the size of the training data set. We could also try different
prior distributions of for the parameters and hyper parameters.
29
Bibliography
[1] H. Berman, T. Battistuz, T. Bhat, W. Bluhm, P. Bourne, K. Burkhardt, Z. Feng,G. Gilliland, L. Iype, S. Jain, P. Fagan, J. Marvin, D. Padilla, V. Ravichandran,B. Schneider, N. Thanki, H. Weissig, J. Westbrook, and C. Zardecki. The proteindata bank. Acta Crystallographica. Section D, Biological Crystallography, 58,2002.
[2] A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, Jr. A graph-theoryalgorithm for rapid protein side-chain prediction. Protein Sci, 12(9), 2003.
[3] R. L. Dunbrack, Jr. Rotamer libraries in the 21st century. Curr. Opin. Struct.Biol., 12, 2002.
[4] R. L. Dunbrack, Jr. and F. E. Cohen. Bayesian statistical analysis of proteinsidechain rotamer preferences. Protein Science, 6, 1997.
[5] R. L. Dunbrack, Jr. and M. Karplus. Backbone-dependent rotamer library forproteins: Application to side-chain prediction. J. Mol. Biol., 230, 1993.
[6] R. L. Dunbrack, Jr. and M. Karplus. Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nature Struct. Biol., 1,1994.
[7] R. Goldstein. Efficient rotamer elimination applied to protein side-chains andrelated spin glasses. Biophys J, 66, 1994.
[8] R. Huey, G. Morris, A. Olson, and D. Goodsell. A semiempirical free energyforce field with charge-based desolvation. Computational Chemistry, 28, 2007.
[9] C. Kingsford, B. Chazelle, and M. Singh. Solving and analyzing side-chain po-sitioning problems using linear and integer programming. Bioinformatics, 21,2005.
[10] P. Koehl and M. Delarue. Application of a self-consistent mean field theoryto predict protein side-chains conformation and estimate their conformationalentropy. J Mol Biol, 239, 1994.
[11] G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack, Jr. Improved predictionof protein side-chain conformations with scwrl4. Protein, 77(4), dec 2009.
30
[12] S. Liang and N. Grishin. Side-chain modeling with an optimized scoring function.Protein Sci, 11, 2002.
[13] D. Muramatsu, M. Kondo, M. Sasaki, S. Tachibana, and T. Matsumoto. Amarkov chain monte carlo algorithm for bayesian dynamic signature verification.IEEE Trans. Information Forensic and Security., 1, 2006.
[14] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996.
[15] J. Ponder and D. Case. Force fields for protein simulations. Advances in ProteinChemistry, 66, 2003.
[16] C. A. Rohl, C. E. M. Strauss, D. Chivian, and D. Baker. Modeling structurallyvariable regions in homologous proteins with rosetta. Proteins, 55, 2004.
[17] C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, and D. Baker. Protein structureprediction using rosetta. In L. Brand and M. L. Johnson, editors, NumericalComputer Methods, Part D, volume 383 of Methods in Enzymology, pages 66 –93. Academic Press, 2004.
[18] K. Vanommeslaeghe, E. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. Shim,E. Darian, O. Guvench, P. Lopes, I. Vorobyov, and A. D. Mackerell, Jr. Charmmgeneral force field: A force field for drug-like molecules compatible with thecharmm all-atom additive biological force fields. Journal of ComputationalChemistry, 2010.
[19] Z. Xiang and B. Honig. Extending the accuracy limits of prediction for side-chainconformations. J Mol Biol, 311, 2001.
[20] J. Xu. Rapid protein side-chain packing via tree decomposition. 9th AnnualInternational Conference on Research in Computational Molecular Biology (RE-COMB), 2005.
31