ann thesis

42
UNIVERSITY OF CALIFORNIA, IRVINE Protein side-chain prediction algorithm using three layer perceptrons DISSERTATION submitted in partial satisfaction of the requirements for the degree of MASTER in Computer Science by Ken Nagata Dissertation Committee: Professor Pierre Baldi, Chair Professor Eric Mjolsness Professor Xiaohui Xie 2010

Upload: m-shariful-bhuyan

Post on 08-Mar-2015

44 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ann Thesis

UNIVERSITY OF CALIFORNIA,IRVINE

Protein side-chain prediction algorithm using three layer perceptrons

DISSERTATION

submitted in partial satisfaction of the requirementsfor the degree of

MASTER

in Computer Science

by

Ken Nagata

Dissertation Committee:Professor Pierre Baldi, Chair

Professor Eric MjolsnessProfessor Xiaohui Xie

2010

Page 2: Ann Thesis

UMI Number: 1476683

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMI 1476683

Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against

unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106-1346

Page 3: Ann Thesis

c⃝ 2010 Ken Nagata

Page 4: Ann Thesis

TABLE OF CONTENTS

Page

LIST OF FIGURES iv

LIST OF TABLES v

ACKNOWLEDGMENTS vi

CURRICULUM VITAE vii

ABSTRACT OF THE DISSERTATION viii

1 Introduction 1

2 Side Chain Prediction 32.1 Rotamer Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Dead-end elimination . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.3 Cyclic Searching . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.4 Integer Linear Programming . . . . . . . . . . . . . . . . . . . 82.3.5 Tree decomposition . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 SCWRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Methods 103.1 Training Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Setting Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Rotamer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Removing clashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Handling fixed atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results 214.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 CPU Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 The number of clashes . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ii

Page 5: Ann Thesis

5 Conclusion 28

Bibliography 30

iii

Page 6: Ann Thesis

LIST OF FIGURES

Page

3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 The relationship of a length of protein and a prediction time . . . . . 26

iv

Page 7: Ann Thesis

LIST OF TABLES

Page

4.1 Accuracies: χ1 and χ2 . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Accuracies: RMSD and Correct Rotamer . . . . . . . . . . . . . . . . 244.3 Time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Accuracy and time for a large protein (1RYP) . . . . . . . . . . . . . 254.5 Accuracy and time for a protein with RNA (1FJG) . . . . . . . . . . 274.6 The number of clashes . . . . . . . . . . . . . . . . . . . . . . . . . . 27

v

Page 8: Ann Thesis

ACKNOWLEDGMENTS

I would like to thank very sincerely Dr. Pierre Baldi for his guidance and assistancefor my research. I could learn a lot of things by working with him. I also wouldlike to thank Dr. Arlo Randall for his guidance and assistance. I gained much fromdiscussions with him.

I would like to thank my parents for taking care of me, their financial support andgiving me a chance to pursue my master degree at University of California Irvine.

vi

Page 9: Ann Thesis

CURRICULUM VITAE

Ken Nagata

EDUCATION

Master of Science in Computer Science 2010University of California, Irvine Irvine, California

Bachelor of Science in Electrical Engineering 2008Waseda University Tokyo, Japan

Exchange student 2007Arizona State University Tempe, Arizona

PROFESSIONAL MEMBERSHIPS

Institute of Electrical and Electronics Engineers (IEEE)

vii

Page 10: Ann Thesis

ABSTRACT OF THE DISSERTATION

Protein side-chain prediction algorithm using three layer perceptrons

By

Ken Nagata

Master in Computer Science

University of California, Irvine, 2010

Professor Pierre Baldi, Chair

Accurate protein side-chain conformation prediction is crucial for protein structure

prediction and protein design methods and a variety of algorithms have been devel-

oped for the task; however, faster and more accurate methods are still required. We

have developed a new prediction method, SIDEpro, which begins by setting the ro-

tamer probabilities for each residue from a backbone dependent rotamer library, then

the probabilities are iteratively updated using a three layer perceptron that takes the

distances to the surrounding atoms and their types as input. When the updated ro-

tamer probabilities all converge the method returns the most likely rotamer for each

residue as the prediction. The resulting models frequently contain steric clashes, thus

we developed a procedure to resolve clashes with minimal change to the prediciton.

SCWRL is widely recognized as the best method for rapid side-chain prediction, and

SCWRL4 is the most recent release. SIDEpro was evaluated using the 379 protein

test set used to evaluate SCWRL4. A training set of 252 proteins with less than 25%

sequence identity to any of the test set proteins was used for training. Using the

SCWRL4 test set, SIDEpro accuracy (χ1 85.92%, χ1+2 74.53%) was slightly better

than SCWRL4-FRM (χ1 85.54%, χ1+2 74.09%) and approximately 5 times faster.

SIDEpro was clearly more accurate than SCWRL4-RRM (χ1 84.35%, χ1+2 71.78%)

and approximately 2 times faster. In addition, on a test set of several very large

viii

Page 11: Ann Thesis

protein complexes, SIDEpro is approximately 10 times faster than SCWRL4-FRM

and 7 times faster than SCWRL-RRM.

ix

Page 12: Ann Thesis

Chapter 1

Introduction

Protein structure prediction is a fundamental problem in computational molecular

biology and predicting the side-chain conformations for a given fixed backbone is an

important subproblem. Side-chain prediction methods are also critical for a protein

engineering and protein design applications, and as a result many algorithms have

been presented for this problem. SCWRL is one of the best methods and is widely

used right now because it is fast, accurate, and convenient to download and run locally

[2][11].

SCWRL finds a combination of rotamers which minimizes an energy function based

on rotamer probabilities and physical/chemical energy terms [2][11]. There are two

types rotamer libraries, backbone-independent and backbone-dependent. Backbone-

independent rotamer libraries contain information on side-chain dihedral angles and

rotamer populations. Backbone-dependent libraries contain the same information as

a function of the backbone dihedral angles ϕ and ψ [3][4][6][5]. SCWRL, and other

most successful prediction methods, utilize the probabilities from backbone-dependent

rotamer libraries [2][11] combined with van der Waals or simple steric energy, elec-

1

Page 13: Ann Thesis

trotatics, and hydrogen bond potentials. In addition, many search methods have

been applied to the side-chain prediction problem, such as dead-end elimination[7],

Monte Carlo[12][16], cyclical search [19] or self-consistent mean field optimization [10],

integer programming [9], and graph decomposition [11][20].

This thesis presents the details of SIDEpro, a new method for side-chain prediction.

SIDEpro uses the output of an artificial neural network (ANN), with a three layer-

perceptron architecture, as the energy function to optimize. For prediction, it initially

sets the probability of rotamers for each residue using a backbone dependent rotamer

library [3][4][6][5], Then, it iteratively updates the rotamer probabilities using the

ANN, and sets the most probable rotamers as the predictions when the rotamer

probabilities all converge. SIDEpro also utilizes a post-processing procedure to reduce

steric clashes because the energy function allows severe clashes in some cases.

The benchmark results show that SIDEpro is more accurate than the latest version of

SCWRL, SCWRL4 [11] and the resulting models have fewer steric clashes than those

produced by SCWRL. In addition, SIDEpro is about five times as fast as SCWRL.

Since the speed of SIDEpro only increases linearly with the number of residues, it

can be applied to very large proteins and protein complexes.

In this thesis, Chapter 2 introduces some side chain prediction algorithms presented

before. Chapter 3 explains the details of training and prediction methods by SIDEpro.

Chapter 4 discusses the accuracies and speed of SIDEpro comparing to SCWRL. The

thesis is concluded in Chapter 5.

2

Page 14: Ann Thesis

Chapter 2

Side Chain Prediction

Side chain prediction problem is a subproblem of protein structure prediction and

predicts a structure of side chain conformation when backbone structure is given.

The side chain conformation is defined by the one to four dihedral angle(s) which is

called χ angles while the backbone is defined by the two dihedral angles which are

called ψ and ϕ angles. The number of χ angles depended on an amino acid’s type.

Thus, the side chain conformation problems can be seen as a prediction of χ angles

for each residue given ψ and ϕ angles. In this chapter, the previous methods of the

side chain prediction will be discussed.

2.1 Rotamer Library

Since the Protein Data Bank (PDB) has grown, a lot of analyzed protein structures

have been available. [1] From this large amount of data, analyzing side chain confor-

mation has been done to get the distribution of side chain conformation. These side

chain conformations called ”rotamer” and rotamer libraries have been established

3

Page 15: Ann Thesis

that show their distribution.

Since those side chain conformations are defined by the dihedral angles, or χ angles,

the rotamer library describes distributions of χ angles. There are two types rotamer li-

braries: backbone-independent and backbone-dependent. The backbone-independent

rotamer libraries contain information on side-chain dihedral angles and rotamer pop-

ulations. That is, this type of library does not contain information about backbone.

The backbone-dependent libraries contain the same information as a function of the

backbone dihedral angles, ϕ and ψ. In other words, the backbone-dependent libraries

show the conditional distribution of χ when ϕ and ψ are given (p(χ|ϕ, ψ)).

For the side chain conformation problem, since the backbone conformation is given,

backbone-dependent rotamer library is often used. To simplify the problem, most

of side chain conformation prediction algorithm assigns rotamer from the library for

each residues such that total energy becomes low. They utilize the probability of

rotamer as a self energy, or energy inside the residue.

2.2 Energy function

The energy calculation is important for the side chain conformation since side chains

are placed such that energy becomes small. Hence, the side chain prediction tries to

find a combination of rotamers which produce a small or minimum energy.

Various energy functions have been applied to improve side chain prediction. The

hydrophobic, electro-statistic or hydrogen bond’s energies are commonly used. Those

energy functions are applied not only for side chain prediction but for other appli-

cations such as molecular dynamics [15][18], docking [8] or other protein structure

prediction problems [17]. The parameters for their energies are sometimes optimized

4

Page 16: Ann Thesis

within their training data, or are set to be force fields parameters such as CHARMM

[18] or Amber [15].

To represent a hydrophobic energy or van der Waals energy, the Lennard-Johns po-

tential is sometimes applied[9][10][11]. In general, the Lennard-Johns potential is

given by

ELJ(r) = 4ϵ[(σr

)m−(σr

)n](2.1)

where ϵ is the depth of potential well, σ is the distance at which the potential is zero,

r is a distance between the atoms and m and n are parameters which are often set to

be 12 and 6 respectively.

The function proposed by [2] is sometimes used for the side chain prediction instead

of the Lennard-Johns potential. This function is given by

Esteric(r) =

0, r > Rij

10, r < 0.825Rij

57.273

(1− r

Rij

), 0.825Rij ≤ r ≤ Rij

(2.2)

where r is the distance between the atoms, and Rij is the sum of the hard-sphere

radii for atoms i and j. This function represents repulsive part of the van der Waals

energy in a linear form.

For electro static energy, Coulomb’s low has been applied[12]. From the Coulomb’s

low, potential between two charges q1 and q2 is given by,

Eelec(r) =1

4piϵ0

q1q2ϵr

, (2.3)

5

Page 17: Ann Thesis

where ϵ0 is electric constant, ϵ is dielectric constant, and r is a distance between the

charges. The ϵ sometimes set as a function of the distance r.

Other potentials has been applied for the side chain prediction such as hydrogen bond

[11] or disulfide bond [2].

2.3 Search methods

Most of the side chain prediction algorithms try to solve the following problem.

r = argminEtotal(r) (2.4)

where

Etotal(r) =N∑i=1

Eself (ri) +N−1∑i=1

N∑j>i

Epair(ri, rj), (2.5)

r = {r1, r2, . . . , rN}, r is optimal rotamers which produce lowest energy, ri is a

rotamer at residue i, Eself (r) is a self energy of the rotamer r, and Epair(ri, rj) is a

energy between rotamer ri and rj. To solve this, a lot of search methods has been

applied for the side chain prediction. In this section, dead-end elimination[7][9], Monte

Carlo[12][16], cyclical search [19], integer programming [9], and graph decomposition

[11][20] will be briefly introduced.

2.3.1 Dead-end elimination

A dead-end elimination (DEE) is a method to find a global minimum energy by

eliminating rotamers which always cause higher energy than one of the other rotamers.

6

Page 18: Ann Thesis

That is, the rotamer si is eliminated if

Eself (si)− Eself (ri) +N∑

j=1,j =i

minrj

[Epair(si, rj)− Epair(ri, rj)] > 0 (2.6)

where ri is the another rotamer. When the equation 2.6 is satisfied, the rotamer si

always has a higher energy than rotamer ri regardless of which rotamer is chosen for

the other side chains. Hence, the rotamer si can be eliminated from the search. This

process can be done efficiently by applying DEE from highest to lowest Eself . [2]

2.3.2 Monte Carlo

Monte Carlo method is a sampling method applying for calculating the integration,

finding the distribution, or optimization. It repeated random sampling to get the re-

sults. For side chain prediction, Metropolis Monte Carlo simulated annealing method

has been applied. In this method, at first, select a position randomly, then one ro-

tamer rnew at the position is selected at random. The new interaction energy Enew is

calculated and if it is lower than previous energy Eold, new rotamer rnew is accepted

as a rotamer at the position. If it is higher than Eold, then rnew is accepted with

the probability exp[(Eold − Enew)/T ] where T is called temperature. If rnew is not

accepted, the rotamer at the position does not change. This process is iterated many

times. When the temperature T is fixed, this method is called a Metropolis method.

Since this method sometimes falls into local minimum, a simulated annealing algo-

rithm is applied. Simulated annealing algorithm changes the value of the temperature

T . At first, T is set to be high, then, decrease it gradually. By doing this, samples

would avoid the local minimum more easily when the temperature is high, and get

samples around the global minimum.

7

Page 19: Ann Thesis

2.3.3 Cyclic Searching

Cyclic searching method is one of the greedy method. For each residue, this method

calculates energies for all rotamers and set the rotamer which is lowest energy. Then

iterate this process until they converge. This method would fall into a local minimum

and does not guarantee an optimal solution.

2.3.4 Integer Linear Programming

Integer programming method converts a side chain conformation problem to an inte-

ger linear programming (ILP) problem. This method formulates side chain prediction

problems by the following.

Minimize

E =N∑i=1

Eself (ri)xriri +N−1∑i=1

N∑j>i

Epair(ri, rj)xrirj

subject to ∑ri∈Ri

xriri = 1 for i = 1, . . . , N∑ri∈Ri

xrirj = xriri for i = 1, . . . , N and i = j

xriri , xrirj ∈ [0, 1]

Then this can be solved as ILP.

2.3.5 Tree decomposition

The residue interaction can be seen as graph G by representing each residue by a

vertex and each interaction between two residues by an edge. A tree decomposition

8

Page 20: Ann Thesis

maps the graph G into a tree T which is a another graph with no cycle. From

the tree T , searching for the optimal side chain assignment is possible based on the

decomposition.[20]

2.4 SCWRL

SCWRL is one of the best methods for a side chain prediction and is widely used

because it is fast, accurate, and easy to download and run locally. As energy functions,

a newest version of SCWRL, SCWRL4, is using a Lennard-Johns potential with some

modifications as steric energy, hydrogen bond energy and disulfide bond energy. By

utilizing DEE and a tree decomposition for search, it achieves fast and accurate

prediction and guarantees an optimal solution.

For SCWRL4, there are two models for prediction: Rigid Rotamer Model (RRM) and

Flexible Rotamer Model (FRM). RRM only considers the center position of rotamers

(mean of χ angles) , while FRM takes into account not only the center but subrotamers

which are the center of rotamers plus and minus standard deviation of rotamers. By

considering the subrotamers, the prediction accuracies improved because FRM allows

more various positions of side chains. Since the search space for FRM is larger than

that of RRM, the speed of FRM is slower than that of RRM. Hence, FRM is more

accurate but slower while RRM is faster but less accurate model.

9

Page 21: Ann Thesis

Chapter 3

Methods

SIDEpro is a new algorithm to predict side chain conformations. SIDEpro is using

trained energy function by three layer percepron and new search method. This chapter

describes the architecture of the artificial neural network for energy functions, how

to train them, how to predict rotamers and the details of the algorithm.

3.1 Training Energy Function

The artificial neural network (ANN) architecture utilized by SIDEpro is a three layer

perceptron. The output of an ANN is a function of input when the weights are given.

The inputs to the ANN are the distances between pairs of atoms, and the output of

the ANN is used as the energy between the atom pairs. The dimension of both input

and output is one. Let d be the distance between two atoms. When the distance

between two atoms is large enough they are not considered to be interacting in our

model, thus we set the energy to zero when the distance is larger than 7 A. Then, the

10

Page 22: Ann Thesis

w11

Bias Bias

w1H

w12

w13

w21

w22

w23

w2H

w31

w41

w42

w43

w4H

e=f(d;w) d

Figure 3.1: Network Architecture

energy e is given by

e = f(d;w) =

H∑

h=1

w4hσ(w2hd+ w1h) + w31, d < 7

0, otherwise

(3.1)

where w contains the weights of the ANN, σ is a sigmoid function, and H is the

number of hidden states. The ANN architecture is shown by Figure (3.1). Although

a sigmoid function is usually applied to the output layer, it is not applied in this case

since the output is not bounded from 0 to 1. A separate ANN was trained for each

amino acid type and for each pair of atom types. The set of training proteins was

used to set the weights of the model with the Markov Chain Monte Carlo method. By

using the energy output from the ANN, the total energy for each rotamer is calculated

as follows.

eij =

Ni∑k=1

L∑l=1,l =i

Nl∑n=1

faibikbln(||Xijk −Xlrln||;w) (3.2)

11

Page 23: Ann Thesis

where

eij : total energy for rotamer j at residue i

Ni : the number of atoms of residue i

L : the length of amino acids sequence

Xijk : coordinate of k-th atom of rotamer j of residue i

ai : amino acid type of residue i

bik : atom type of k-th atom of residue i

fabc(·) : energy function or output of neural network for amino acid a, atom b

and atom c

Before training, we set all rotamers at the center position (mean of χ angles) of the

correct rotamer which is defined by the following:

ri = argmaxj

p(χi|µij ,σij)pij (3.3)

where χi is a vector of χ angles of residue i from a crystal structure, µij and σij are

vectors of mean and standard deviation of χ angles of rotamer j for residue i from

the rotamer library and pij is a probability of the rotamer from the library. This is

done because this new structure is the one we want to predict.

The probability of rotamer at residue i is defined by the following using energy from

equation (3.2) when the weights of neural network w are given.

P (ri|w) =exp(−Keiri)piri∑Ri

j=1 exp(−Keij)pij(3.4)

12

Page 24: Ann Thesis

where Ri is the number of rotamers for residue i and K is constant value. This

equation can be seen as a likelihood function. Hence posterior distribution of w is

given by the following.

P (w|r) =∏i

P (ri|w)P (w) (3.5)

where P (w) is a prior distribution of w. Applying Markov Chain Monte Carlo to

get samples of the weight parameter of three layer perceptrons was done.[13][14] We

assumed that prior distribution of w follows a normal distribution N(0, 1α). Hence,

P (w|α) =4∏

i=1

√αi

2πexp(−αi

2

∑j

w2ij) (3.6)

where α is a vector of hyper parameters and each αi controls the distribution of

wi. Since hyper parameter α should be non-negative value, we assumed each α are

independent and follows gamma distribution such that:

P (α) =4∏

i=1

αk−1i

exp(−αi/θ)

θkΓ(k)(3.7)

k and θ are constants and parameters for gamma distribution. Hence, posterior

distribution of w and α is given by the followings.

P (w|r,α) =

(4∏

i=1

√αi

)exp

(−∑i

αi

2

∑j

w2ij +

∑i

log p(ri|w)

)(3.8)

P (αi|r,w) ∝ αk−1i exp

(−αi

(θ−1 +

1

2

∑j

w2ij

))∝ Gamma

k,(θ−1 +1

2

∑j

w2ij

)−1

(3.9)

13

Page 25: Ann Thesis

The distributions of weights w and hyper parameter α are complex and it is impos-

sible to calculate it analytically. To solve this problem, we applied Markov Chain

Monte Carlo method to get samples of posterior distributions of w and α, that is,

p(w,α|r). We used the following two types of sampling methods to get those samples.

1 Sampling α(j) by Gibbs sampling method

2 Sampling w(j) by Metropolis method

By iterating this process, we can get samples of the posterior distributions. Hence,

we can get samples of α(j) by

α(j+1)1 ∼ P (α1|α(j)

2 , α(j)3 , α

(j)4 ,w(j)) (3.10)

α(j+1)2 ∼ P (α2|α(j+1)

1 , α(j)3 , α

(j)4 ,w(j)) (3.11)

α(j+1)3 ∼ P (α3|α(j+1)

1 , α(j+1)2 , α

(j)4 ,w(j)) (3.12)

α(j+1)4 ∼ P (α4|α(j+1)

1 , α(j+1)2 , α

(j+1)3 ,w(j)) (3.13)

Samples of w(j) are given by the following.

1 get candidate sample by

w∗ ∼ N(wj, σ2) (3.14)

2 accept sample w∗ as w(j+1) with probability min[

P (w∗)

P (w(j)), 1]

14

Page 26: Ann Thesis

We used the following optimal weight w∗ as a weight parameter for prediction.

w∗ = argmaxw(j)

P (w(j)|r,α) (3.15)

3.2 Setting Neighbors

Before predicting rotamers, SIDEpro finds neighbor’s residue for each residue to re-

duce calculation cost. Since SIDEpro set energy to zero when the distance is larger

than seven as defined by equation (3.1), we can ignore residue pairs if they are far

enough. We defined two types of neighbors for each residue: backbone Nghbb and

side-chain Nghsc. For backbone neighbors, the backbone of residues must be close

enough to cause energy with the residue. On the same way, at least one pair of side

chain of the rotamers must be close enough for side chain neighbors.

At first, we set three types of boxes; boxbb(i) and boxsc(i) for residue i, and boxatom(i, k)

for atom k at residue i. They are defined by the following.

boxbb(i) =

{X| min

k∈BackboneXi1k − 3.5 ≤ X ≤ max

k∈BackboneXi1k + 3.5

}(3.16)

boxsc(i) =

{X| min

j,k∈SidechainXijk − 3.5 ≤ X ≤ max

j,k∈SidechainXijk + 3.5

}(3.17)

boxatom(i, k) =

{X|min

jXijk − 3.5 ≤ X ≤ max

jXijk + 3.5

}(3.18)

From this, we can say boxbb(i), boxsc(i) and boxatom(i, j) contain all backbone atoms

at residue i, all side chain atom coordinates for all rotamers at residue i, and all

15

Page 27: Ann Thesis

positions of atom k for all rotamers at residue i respectively. Furthermore, if two

boxes does not intersect, a distance between two atoms from each of them is greater

than 7. Now, we find neighbors by the following. For backbone neighbors Nghbb, if

boxsc(i) and boxbb(j) have intersections for i = j, add {j} to Nghbb(i). For side chain

neighbors Nghbb, check if boxsc(i) and boxsc(j) have intersections. If they do and if

there are at least one intersection between boxatom(i, k) and boxatom(j, l) for all k and

l, add {j} to Nghsc(i). We can get all boxes in O(∑NiR) and define neighbors in

O(L2N2).

3.3 Rotamer Prediction

For prediction, SIDEpro tried to find probability of rotamers for each residue. Before

finding the probabilities, SIDEpro removes rotamers which clash with backbone. We

defined that rotamer clashes with backbone if

Evdw(i, j) =

∑k∈Sidechain

∑l∈Nghbb(i)

∑n∈Backbone

(rbik + rbln − ||Xijk −Xl1n||)2,

||Xijk −Xl1n||rbik + rbln

< 0.67

0, otherwise

(3.19)

is greater than 1.0. rb represents a radius of atom b. This process reduces the size

of search space. Then it sets neighbors as the way discussed above. As next step, it

calculates energy for side chain with backbone Esb and with side chain Ess. Finally,

it calculates probability of rotamers then set to the most probable one as prediction.

Algorithm 1 shows SIDEpro algorithm to calculate probability of rotamers.

16

Page 28: Ann Thesis

Algorithm 1 SIDEpro algorithm

Set boxbb, boxsc and boxatom

Set Nghbb and Nghsc

//set qij to be pij and remove clashing rotamersfor all residue i do

for all rotamers at residue j doqij=pijcalculate Evdw(i, j) by equation (3.19)if Evdw(i, j) > 1.0 then

remove rotamer jend if

end forend for//calculate Esb

for all residue i dofor all rotamers j at residue i do

Esb(i, j) =∑

k∈Sidechain

∑l∈Nghbb(i)

∑n∈Backbone

faibikbln(||Xijk −Xl1n||;w∗)

end forend for//calculate Ess

for all residue i dofor all rotamers j at residue i do

for all l ∈ Nghsc(i) dofor all rotamers m at residue l do

Ess(i, j, l,m) =∑

k∈Sidechain

∑n∈Sidechain

faibikbln(||Xijk −Xlmn||;w∗)

end forend for

end forend for

17

Page 29: Ann Thesis

//calculate qijwhile qij does not converge do

for all residue i do//update energy of rotamers at residue ifor all rotamers j at residue i do

eij = Esb(i, j) +∑

l∈Nghsc(i)

Rl∑m=1

qlmEss(i, j, l,m)

end for//update probability of rotamers at residue ifor all rotamers j at residue i do

qij =exp(−Keij)pij∑Ri

k=1 exp(−Keik)pikend for

end forend whilefor all residue i do

ri = argmaxj

qij

end for

where

qij : probability for rotamer j at residue i

pij : probability of rotamer j at residue i from the library

ri : predicted rotamer for residue i

18

Page 30: Ann Thesis

3.4 Removing clashes

The initial models produced by SIDEpro use only the means of the rotamers, thus it

can be considered a rigid rotamer model. When a rigid rotamer model is used, some

clashes are almost always observed in a compact protein model. In fact, the structures

used for training contain some clashes because the rotamers were set to the center of

the closest rotamer in the library. In addition, since the SIDEpro energy function is

not based on physical or chemical energy, but rather is learned from training data,

the initial models produced by SIDEpro frequently contain relatively high numbers of

steric clashes. To resolve these clashes we developed a post-processing method which

allows some flexibility of the rotamers.

We considered atoms to be clashing if the distance between them was than 0.8 * (sum

of vDW radii). We significantly reduced the number of clashes using the following

protocol. First, each residue is checked for clashes. For each residue that has one or

more clashing atoms, change chi angles to mean of chi µ plus or minus the standard

deviation σ multiplied by n, measure distances of each atoms and find the minimum

distance. Then, set chi angles such that minimum distance is largest among the

combination of µ ± nσ. Repeat this operation until it converges. Then increase

the value of n and repeat the process again. We set n = 0.5, 1.0 and 1.5. This

method reduced the number of clashes with only a minimal effect on accuracy. The

time and accuracy results presented in the next section are calculated only from the

clash-reduced models.

19

Page 31: Ann Thesis

3.5 Handling fixed atoms

SIDEpro can fix certain atoms by using an optional parameter in a same manner as

SCWRL. By using this option, not only side chains of proteins but also DNA, RNA

and other types of molecules can be considered to predict the side chain conformations.

Since SIDEpro can only handle C, O, N, H and S atoms, we treated other types of

atoms as C for fixed atoms.

All fixed atoms are treated in a similar way to how backbone atoms are treated.

Hence, SIDEpro builds boxes and neighbors for them, and calculates the energy Esb

in the same way as backbone atoms. The only difference is that they are not included

in the calculation of Evdw which defines clashes between side chains and backbones.

20

Page 32: Ann Thesis

Chapter 4

Results

For a side chain prediction, accuracy and time are important. We evaluated and

compared SIDEpro and SCWRL using some evaluation methods. In this chapter, a

detail of training and test dataset, evaluation methods, accuracies, CPU time, and

the number of clashes will be shown.

4.1 Data

We used 252 pdb files as the training data to obtain the ANN weight parameters

and the 379 protein SCWRL4 test set to evaluate SIDEpro. For the training data,

we obtained the list of proteins using PISCES server, at maximum mutual sequence

identity 30%, ≤ 1.8 A resolution, and maximum R-factor of 20 %. Then we randomly

picked up 252 files from the list excluding proteins with more than 25% sequence

identity with one or protein in the test set.

The published SCWRL3 back-bone dependent rotamer library was used for prediction

[2][3][4][6][5].

21

Page 33: Ann Thesis

4.2 Accuracy

Four methods were used to evaluate the results. The first two methods are χ1 and

χ1+2, which are widely used methods for evaluation of side-chain prediction algo-

rithms. χ1 within 40 degrees compared to the crystal structure is considered correct

for the χ1 method. For χ1+2, both χ1 and χ2 within 40 degrees is considered correct.

The third evaluation measure is RMSD. There are two ways to calculate RMSD; one

is calculate RMSD for each residue and take average, and another one is calculate

entire RMSD. We used former definition.

The last method is an evaluation method we defined. We called this method ”correct

rotamer”. We compared the correct rotamer defined by

ri = argmaxj

p(χi|µij ,σij) (4.1)

for crystal structure and predicted structure. We considered the side chain confor-

mation is predicted correctly when r are same.

Table 4.1 summarizes the accuracy results for SCWRL4 [11] and SIDEpro. SCWRL4

has options to choose Flexible Rotamer Model (FRM) and Rigid Rotamer Model

(RRM). We performed tests using both settings and found that FRM is consistenly

more accurate but slower than RRM, which agrees with [11]. Thus, the detailed

accuracies presented for SCWRL4 are those calculated using the FRM setting. The

SIDEpro results are slightly better according to all evaluation methods. The accura-

cies for correct rotamer, χ1 and χ1+2 are 66.41%, 85.92% and 74.53% respectively.

22

Page 34: Ann Thesis

Table 4.1: Accuracies: χ1 and χ2

Residue typeχ1 χ1+2

SIDEpro SCWRL4 SIDEpro SCWRL4CYS 90.53 89.91 —– —–ASP 84.33 83.57 65.01 65.18GLU 73 72.8 57.58 56.83PHE 96.3 95.42 92.61 92.89HIS 88.96 88.31 77.79 77.31ILE 95.37 95.84 83.06 84.33LYS 79.87 77.3 67.41 63.73LEU 93.5 92.3 86.69 86.33MET 83.96 84.23 73.19 70.85ASN 85.7 84.17 76.76 74.17PRO 83.9 85.73 —– —–GLN 77.89 79.35 60.26 59.67ARG 79.13 78.33 64.78 65.57SER 72.38 70.52 —– —–THR 90.03 90.46 —– —–VAL 92.85 92.92 —– —–TRP 93.29 92.7 82.67 80.26TYR 93.09 95.19 89.13 92.55ALL 85.92 85.54 74.53 74.09

4.3 CPU Time

The speed is an important characteristic of a side chain prediction method. Table 4.3

shows mean, median, and maximum prediction times of both SIDEpro and SCWRL4

on the 379 protein test set. The timing results using both FRM and RRM are

shown for SCWRL4. SIDEpro is approximately five times faster than SCWRL4-

FRM according to the mean, and four times faster for the median. The maximum

time of SIDEpro is about 1/10 that of SCWRL4-FRM.

SIDEpro is approximately 40% faster than SCWRL4-RRM according to the mean,

although the SCWRL4-RRM median is less than that of SIDEpro. The maximum

time of SIDEpro is approximately 1/7 that of SCWRL-RRM.

23

Page 35: Ann Thesis

Table 4.2: Accuracies: RMSD and Correct Rotamer

Residue typeRMSD Correct Rotamer

SIDEpro SCWRL4 SIDEpro SCWRL4CYS 0.347 0.354 90.78 90.09ASP 0.865 0.860 58.68 55.69GLU 1.459 1.454 30.96 30.28PHE 1.096 1.134 88.98 88.12HIS 1.198 1.217 52.78 51.94ILE 0.449 0.384 82.84 84.15LYS 1.423 1.507 36.58 32.16LEU 0.528 0.502 86.42 86.22MET 1.042 1.089 54.4 54.08ASN 0.912 0.937 46.94 43.06PRO 0.294 0.216 79.61 81.59GLN 1.472 1.472 31.3 31.08ARG 2.200 2.374 26.13 30.44SER 0.549 0.572 72.61 70.76THR 0.352 0.323 90.16 90.62VAL 0.287 0.268 92.92 92.96TRP 1.158 1.271 81.17 78.63TYR 1.346 1.217 85.42 87.88ALL 0.821 0.824 66.41 65.97

Figure 4.1 shows the relationship between the number of residues in a protein or

protein complex and the prediction time. For SIDEpro, the time increases linearly

according to the number of residues with an R squared value of 0.9205.

For both SCWRL methods, the time generally increases with the number of residues,

but the relationship is much less predictable. The Rsq values for SCWRL4-FRM is

0.5452 and for SCWRL4-RRM it is 0.2097.

To further analyze the prediction times of the methods, we teseted them on a small set

of large protein complexes curated from the Protein Data Bank. The table 4.4 shows

accuracies and time for the protein. From the table, SIDEpro is better for the accuracy

for correct rotamer and χ1 and SCWRL (FRM) is better for χ1+2 and RMSD. For

24

Page 36: Ann Thesis

time, SIDEpro is one tenth of SCWRL (FRM) and one seventh of SCWRL (RRM).

Hence, SIDEpro predicted fast with almost same accuracy comparing to SCWRL

(FRM).

We also tested for the protein in complex with RNA. Since both SIDEpro and SCWRL

have an option to fix some atoms such as DNA or RNA, we compared the accura-

cies and time for the protein complex when predicted with and without fixed RNA

structure. The table 4.5 shows the accuracies and time for the protein. From the

table, it is evident that the accuracies are better for all algorithms when predicted

with the RNA structure; however, the prediction times increased. When comparing

the algorithms, the prediction of SIDEpro with the RNA shows the highest accura-

cies for both χ1 and χ1+2. The prediction time for SIDEpro only increased 17.7%

by including the RNA structure while there were 2136.1% and 680.3% increases for

SCWRL(FRM) and SCWRL(RRM) respectively.

The calculations were performed on a machine with an AMD Turion 64 X2 Mobile

Technology TL-60+ at 2.00 GHz, with 3.00 GB of RAM, and running by 32-bit

Microsoft Windows Vista Home Premium Service Pack 2.

Table 4.3: Time comparison

Program Mean(sec) Median(sec) Max(sec)SIDEpro 2.79 2.33 11.312

SCWRL (FRM) 13.23 9.19 108.66SCWRL (RRM) 4.54 1.55 77.63

Table 4.4: Accuracy and time for a large protein (1RYP)

Algorithm Correct Rotamer χ1 χ1+2 RMSD timeSIDEpro 62.61 81.96 71.2 0.921 72.9

SCWRL(FRM) 62.58 81.82 71.47 0.908 781.1SCWRL(RRM) 59.47 80.63 68.55 0.950 525.1

25

Page 37: Ann Thesis

0 200 400 600 800 1000 1200

020

4060

8010

0

0 200 400 600 800 1000 1200

020

4060

8010

0

0 200 400 600 800 1000 1200

020

4060

8010

0

Length

Tim

e(se

c)

SIDEproSCWRL4 FRMSCWRL4 RRM

Figure 4.1: The relationship of a length of protein and a prediction time

26

Page 38: Ann Thesis

Table 4.5: Accuracy and time for a protein with RNA (1FJG)

Algorithm condition χ1 χ1+2 timeSIDEpro isolated 0.71 0.51 11.42SIDEpro with RNA 0.73 0.53 13.44

SCWRL(FRM) isolated 0.71 0.5 179.78SCWRL(FRM) with RNA 0.72 0.52 4020SCWRL(RRM) isolated 0.7 0.48 102SCWRL(RRM) with RNA 0.71 0.51 795.89

4.4 The number of clashes

Due to using trained energy function, SIDEpro applied removing clashes algorithm

at the end of prediction. Table 4.6 shows the number of clashes for the final SIDEpro

models, the initial SIDEpro models before the clash reduction procedure, SCWRL4-

RRM, and SCWRL-FRM, and crystal structure. When the clash removing protocol

is applied, the SIDEpro models have significantly fewer steric clashes than SCWRL

models.

Table 4.6: The number of clashes

SIDEpro SIDEpro(before removing clahses) SCWRL Crystal12895 39558 18009 2031

27

Page 39: Ann Thesis

Chapter 5

Conclusion

When compared to SCWRL4, the side-chain prediction algorithm presented here,

SIDEpro, shows slightly better results in terms of accuracy and significant improve-

ment in terms of speed. The ANN energy functions based on atom-atom distances

are a novel aspect of the overall algorithm presented here and we have shown that

they can be used to help improve the accuracy and speed. Although SIDEpro only

used distance of atoms to calculate energy, it shows better result than SCWRL which

uses more information such as angles among atoms to calculate hydrogen bonding

energy.

Regarding directions for improving the current prediction, it is possible to add more

information to the ANNs by increasing dimensions of input or constructing new net-

works for new information. Flexibility of side chains around rotamer dihedral angle

values and standard bond length and angles are important for this problem. The

more flexible model improved the accuracy. The flexible rotamer model (FRM) is

the most accurate SCWRL predictor. Another possible direction is to build a flexible

rotamer model version of SIDEpro by changing likelihood function. We expect this

28

Page 40: Ann Thesis

would increase the accuracy at the expense of speed, as is observed with the SCWRL4

versions.

For training, we applied Markov Chain Monte Carlo. Since we used the weights

which maximize the likelihood, the parameters may have been overfitted to training

data. It is also possible that the the parameters would represent a local maximum.

These issues could be avoided, or at least reduced, by taking the average over many

samples or increasing the size of the training data set. We could also try different

prior distributions of for the parameters and hyper parameters.

29

Page 41: Ann Thesis

Bibliography

[1] H. Berman, T. Battistuz, T. Bhat, W. Bluhm, P. Bourne, K. Burkhardt, Z. Feng,G. Gilliland, L. Iype, S. Jain, P. Fagan, J. Marvin, D. Padilla, V. Ravichandran,B. Schneider, N. Thanki, H. Weissig, J. Westbrook, and C. Zardecki. The proteindata bank. Acta Crystallographica. Section D, Biological Crystallography, 58,2002.

[2] A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, Jr. A graph-theoryalgorithm for rapid protein side-chain prediction. Protein Sci, 12(9), 2003.

[3] R. L. Dunbrack, Jr. Rotamer libraries in the 21st century. Curr. Opin. Struct.Biol., 12, 2002.

[4] R. L. Dunbrack, Jr. and F. E. Cohen. Bayesian statistical analysis of proteinsidechain rotamer preferences. Protein Science, 6, 1997.

[5] R. L. Dunbrack, Jr. and M. Karplus. Backbone-dependent rotamer library forproteins: Application to side-chain prediction. J. Mol. Biol., 230, 1993.

[6] R. L. Dunbrack, Jr. and M. Karplus. Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nature Struct. Biol., 1,1994.

[7] R. Goldstein. Efficient rotamer elimination applied to protein side-chains andrelated spin glasses. Biophys J, 66, 1994.

[8] R. Huey, G. Morris, A. Olson, and D. Goodsell. A semiempirical free energyforce field with charge-based desolvation. Computational Chemistry, 28, 2007.

[9] C. Kingsford, B. Chazelle, and M. Singh. Solving and analyzing side-chain po-sitioning problems using linear and integer programming. Bioinformatics, 21,2005.

[10] P. Koehl and M. Delarue. Application of a self-consistent mean field theoryto predict protein side-chains conformation and estimate their conformationalentropy. J Mol Biol, 239, 1994.

[11] G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack, Jr. Improved predictionof protein side-chain conformations with scwrl4. Protein, 77(4), dec 2009.

30

Page 42: Ann Thesis

[12] S. Liang and N. Grishin. Side-chain modeling with an optimized scoring function.Protein Sci, 11, 2002.

[13] D. Muramatsu, M. Kondo, M. Sasaki, S. Tachibana, and T. Matsumoto. Amarkov chain monte carlo algorithm for bayesian dynamic signature verification.IEEE Trans. Information Forensic and Security., 1, 2006.

[14] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996.

[15] J. Ponder and D. Case. Force fields for protein simulations. Advances in ProteinChemistry, 66, 2003.

[16] C. A. Rohl, C. E. M. Strauss, D. Chivian, and D. Baker. Modeling structurallyvariable regions in homologous proteins with rosetta. Proteins, 55, 2004.

[17] C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, and D. Baker. Protein structureprediction using rosetta. In L. Brand and M. L. Johnson, editors, NumericalComputer Methods, Part D, volume 383 of Methods in Enzymology, pages 66 –93. Academic Press, 2004.

[18] K. Vanommeslaeghe, E. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. Shim,E. Darian, O. Guvench, P. Lopes, I. Vorobyov, and A. D. Mackerell, Jr. Charmmgeneral force field: A force field for drug-like molecules compatible with thecharmm all-atom additive biological force fields. Journal of ComputationalChemistry, 2010.

[19] Z. Xiang and B. Honig. Extending the accuracy limits of prediction for side-chainconformations. J Mol Biol, 311, 2001.

[20] J. Xu. Rapid protein side-chain packing via tree decomposition. 9th AnnualInternational Conference on Research in Computational Molecular Biology (RE-COMB), 2005.

31