protein structure prediction problem: formalization using quaternions
TRANSCRIPT
![Page 1: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/1.jpg)
Cybernetics and Systems Analysis, Vol. 49, No. 4, July, 2013
PROTEIN STRUCTURE PREDICTION PROBLEM:
FORMALIZATION USING QUATERNIONS
L. F. Hulianytskyia†
and V. O. Rudyka‡
UDC 519.8
Abstract. The authors discuss the formalization of protein tertiary structure prediction problem based
on Dill’s HP-model. Three-dimensional discrete lattices and different approaches to representing paths
on them are the subjects of investigation. Two ways of path encoding are proposed and formalized, one
of which is based on quaternions.
Keywords: protein folding, tertiary protein structure molecule, discrete lattices, quaternions.
INTRODUCTION
The problem of predicting the tertiary structure of a protein molecule called also the protein folding problem has been
recently one of the most important and investigated problems in computational biology. This structure plays a key role in
determining the functional properties of proteins and is a source of information that is important to obtain fundamental and
applied results in various fields of science and technology such as bioinformatics, medicine, pharmaceutics, computing
geometry, and nanotechnologies.
The essence of the protein tertiary structure prediction problem is as follows: given a linear sequence of elements
constituting a molecule, determine its three-dimensional configuration. To solve this problem, the well-known experimental
approaches are used (such as X-ray crystallography and magnetic-resonance spectroscopy); however, they are not only
expensive and time-consuming but also not always give satisfactory results in practice. Therefore, mathematical methods
have been widely applied in recent years in the analysis of the structure of molecules.
The biophysical protein folding HP-model proposed by K. Dill in 1985 is most extensively studied: it finds the
molecule structure that minimizes energy potential [1–3]. The mathematical solution of the problem is a continuous
two-dimensional or three-dimensional curve without self-intersections. An overwhelming majority of the well-known
protein structure models are discrete since the shape of the molecule is represented by a path in a discrete lattice. Noteworthy
is that even in case of significantly simplified chemical and biological properties of protein, the mathematical modeling
involves NP-hard optimization problems [4].
To simplify the model, two-dimensional or three-dimensional cubic lattices are considered most often [5–11], though
they have shortcomings in the context of this problem. Passing to more complex lattices necessitates path encoding as
a mathematical object that reflects its characteristics in the best way; the search using various optimization algorithms will be
carried out among these objects.
An important condition of the successful application of the methods of modeling protein spatial structure is the choice
of adequate mathematical tools for the formal description of the problem. In the present paper we will briefly outline the
principles that underlie Dill’s model and present the properties of lattices as mathematical objects, in particular, the
invariance with respect to transitions and revolutions and the neighboring of nodes.
We will propose two approaches to path encoding in three-dimensional lattices, which have different properties in
their construction and use in protein structure prediction algorithms. A special attention will be paid to the application of
quaternion tools for the design of q-encoding.
5971060-0396/13/4904-0597
©
2013 Springer Science+Business Media New York
a
V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine,
†
‡
[email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4,
July–August, 2013, pp. 130–136. Original article submitted January 24, 2013.
![Page 2: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/2.jpg)
598
DILL’S MODEL
Protein molecules consist of amino acids sequentially connected by peptide bond. A chain of amino acids is called
primary protein structure. Hydrogen bonds are formed in the interaction inside the molecule; as a result, parts of the chain
are folded in spatial structures: �-helices, �-sheets, and other fragments that constitute protein secondary structure [12, 13].
Then these elements form some spatial configuration, which is tertiary protein structure. Depending on the properties, all
amino acids are divided into two classes: hydrophobic and polar. When a molecule folds in a polar medium (water), polar
amino acids move to the molecule surface, and hydrophobic ones move inside it. Hydrophobic bonds occur between closely
spaced hydrophobic amino acids and play a key role in the formation of tertiary protein structure. The bonds inside the
molecule determine its energy. According to the thermodynamic hypothesis, protein takes the structure that minimizes its
free energy. This structure is called native conformation.
In studying proteins, the dogma “sequence–structure–functionality” is used. We mean that protein functions directly
depend on its spatial configuration, which in turn is uniquely determined by its primary structure [12]. However, there also
exist conformer proteins with two native conformations. As a rule, their energy differs insignificantly in such cases.
The protein tertiary structure prediction problem is: given a sequence of amino acids, determine the protein spatial
configuration. The methods of its solution are divided into two classes: statistical and de novo. Statistical methods are based
on the fact that similar parts of primary sequence fold into similar three-dimensional forms. The database of the known
proteins is analyzed for similarity with the given protein and as a result, the tertiary structure is constructed. Using such
approach provides a satisfactory accuracy; however, similar sequences can be found by no means for all primary sequences.
In such cases, de novo methods are applied, which do not employ additional information except for the primary structure,
and the energy minimization problem is stated.
The hydrophobic polar Dill’s model pertains to de novo methods. In this model, the primary protein structure is
specified by a sequence S n� � � �1 2
� , � i H P�{ }, i n� 1, , where H and P are a hydrophobic and a polar amino acids,
respectively, and n is the number of amino acids in the molecule. Tertiary structure is a non-self-intersecting path of the
corresponding length in a certain discrete lattice. Amino-acid residues are sequentially located at each node of the path. It is
assumed that hydrophobic bonds occur among hydrophobic residues in neighboring nodes (but not in those neighboring in
the primary sequence). Energy of the structure is the number of bonds in it with the minus sign. The problem is to find the
structure with minimum energy. More formally, if an amino acid � i in the structure S is associated with a node U i( )� , the
energy is defined by the formula
E S I U U h hi j
i j n
i j( ) ( ( ), ( )) ( ) ( )� �
� � � � �
�� � � �
1 2 2
,
where
I U UU U
( , )
,
1 2
1 2
1
0
�
if nodes and are neighboring
othetwise;
�
�
hH
P( )
,
.
�
�
�
�
�
�
�
�
1
0
if
if
LATTICES
The term “lattice” is used in various fields of mathematics. In group theory, a lattice L in a Euclidean space �n
is
defined as a discrete subgroup �n
[14]. It can be represented as a set of vectors L a e ai i
i
m
i� �
�
�
�
��
�
1
| � , where
B e e em
n� { }
1 2
, , ,� � is some basis and � is the set of integers. Different bases can generate identical lattices. In what
follows, we will consider lattices in a three-dimensional Euclidean space and call their elements nodes.
A lattice is said to be invariant with respect to the mapping fn n
: � �� if f v L( )� for any v L� . Since a lattice is
a subgroup in �n
, it is invariant with respect to translation by any vector v L� .
Neighboring in a Lattice. To formalize the concept of a path in a lattice, it is necessary to introduce the concept of
neighboring nodes. To this end, we specify the binary neighboring relation R L L � : a node v L� is neighboring for the
node u L� if and only if ( , )u v R� . Neighboring in a lattice is invariant with respect to the mapping fn n
: � �� if the
lattice is invariant with respect to f and for two neighboring nodes u v L, � the nodes f u( ) and f v( ) are also neighboring. If
the neighboring relation is invariant with respect to translation by any vector v L� , then it can be defined using some set of
![Page 3: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/3.jpg)
neighborhood vectors V v v Ls� { }
1
, ,� : a node c L1
� is neighboring to a node c L2
� if and only if c c V1 2
� � . And vice
versa, if neighboring is specified as described above, it is invariant with respect to translation by any vector v L� . For
a neighboring relation to be symmetric, the condition v V v V� � � � should be satisfied.
By a path of length m in a lattice L with neighboring R is meant a sequence c c c cm m( )
�1 2
� such that c Li � ,
i m� 1, , and the condition ( , )c c Ri i� �1
, i m� �1 1, (connectivity condition) is satisfied. Note that a path does not contain
self-intersections if the condition c ci j� yields i j� , i j m, ,� 1 .
ENCODING
In solving combinatorial optimization problems whose solution space is paths in some discrete lattice, it often turns
out to be inefficient to represent a path directly as a sequence of coordinates of lattice nodes since not any such sequence
satisfies the path connectivity condition. An alternative representation can be other ways of path encoding.
Let � be the set of natural numbers. In what follows, by a path encoding c c cm1 2
� , denoted by Enc ( ... )c c cm1 2
� s s sk1 2
� , we will mean a sequence s s sm p1
2
�
�
, s Si � , i m p� �1, , where S is encoding alphabet, and a p �� is
encoding parameter if the following conditions are satisfied:
— for any m�� and s s s Sm p1 2
, , ,�
�
� there exist c c c Lm1 2
, , ,� � such that equality Enc ( )c c cm1 2
�
� ��
s s sm p1 2
is true;
— if Enc ( ... ) ...c c c s s sm m p1 2 1 2
��
, then Enc ( ... ) ...c c c c s sm m1 2 1 1 2�
� s sm p m p� � �1
.
An encoding is invariant with respect to a mapping fn n
: � �� if the lattice and the neighboring relation in it are
invariant with respect to f and for any m��, c c c Lm1 2
, , ,� � the condition Enc ( ... )c c c s s sm m p1 2 1 2
��
� yields
Enc ( ( ) ( ) ( )) ...f c f c f c s s sm m p1 2 1 2
� ��
.
Encoding divides the set of paths in a lattice into equivalence classes: paths c c cm1 2
... and� � �c c cm1 2
... belong to one
class if and only if Enc Enc( ) ( ... )c c c c c cm m1 2 1 2
� � � � �.
Absolute Encoding. An encoding defined by the formula
Encabs m m mc c c c c c c c c( ... ) ( )( )... ( )
1 2 2 1 3 2 1
� � � ��
(1)
is called absolute. Its alphabet is the whole set V, and the value of the parameter p � �1. It can be shown that absolute
encoding is invariant with respect to translation by any vector v L� and divides the set of paths into equivalence
classes inside which elements are equal to within translation.
It is convenient to use absolute encoding to solve problems where the coordinates of the initial point in paths are not
important. An advantage of absolute encoding over a sequence of node coordinates is that the path connectivity condition is
satisfied automatically.
Applying Quaternions in Protein Structure Prediction Problems. In what follows, we will use the concept of
quaternions to describe the process of revolution. Let us present the main information [15] necessary to formalize the
problem under study.
A field of quaternions H is a set of pairs of the form ( , )a u�
, where a �� and
�
u ��3
with addition and multiplication
operations defined as follows:
( , ) ( , ) ( , )a u b v a b u v� � � �
� � � � ,
( , )( , ) ( , )a u b v ab u v av bu u v� � � � � � � �
� � � � � � .
Point and symbol � denote scalar and vector products, respectively.
The norm of a quaternion q x y z� ( , ( , , ))� is naturally defined as | | | |q x y z� � � ��2 2 2 2
. The quaternion
conjugate to the given quaternion q x y z� ( , ( , , ))� , is the quaternion q x y z* ( , ( , , ))� �� .
The formula to calculate the inverse quaternion in multiplication has the form
q
�
�
1
2
*
| | | |
.
599
![Page 4: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/4.jpg)
A three-dimensional vector
�
v ��3
can be considered as a quaternion ( , )0
�
v . Let q u� ( , )�
�
be a quaternion with unit
norm and
�
v ��3
. Then the result of the revolution of vector
�
v about the axis
�
u by an angle � �� 2arccos can be represented
as the product qvq�
�1
.
Constructing a q-Encoding. Let us consider a three-dimensional lattice L with the set of neighborhood vectorsV, for
which the following property holds: if v V� is a fixed vector, then
� � � � � � � � � � �
� � �
�
��
�
v V q v V q vq V q v q vv v v v vH: ( ) ( )
1 1
.
(2)
If we denote Q Q v v VV � �{ }( )| , then it can be shown that such lattices are invariant with respect to revolutions
specified by quaternions q QV� .
In the two-dimensional case, along with absolute encoding, relative one is used, where the position of the next amino
acid is specified with respect to the previous one [7]. We will employ the concept of quaternions and design the procedure
that uses absolute path encoding to construct its encoding Encq , and will call it q-encoding. Note that q-encoding is an
analog of relative encoding, used in lattices on a plane, in three-dimensional case.
Let us fix some neighborhood vector a V0
� . From condition (2) it follows that there exists a function Q V Qv: � such
that for all a V� the equality Q a a Q a a( ) ( )
0
1�
� holds; as Q a( )
0
we take identical quaternion, i.e., that describing zero turn
Q a( ) ( , ( , , ))
0
0 0 0 1� . Let us prove the following statement.
Statement. If q QV� , then Q qa q q( )
0
1�
� .
Proof. From the condition q QV� it follows that there exists a vector v V� such that Q v q( ) � and simultaneously
qa q v0
1�
� . Then substituting the second equality into the first one yields the required statement. The proof is completed.
Let now absolute path encoding a a am1 2 1
...
�
� Encabs mc c c( ... )
1 2
calculated by formula (1) be given. We will
construct a q-encoding r r rm1 2 2
...
�
� Encq mc c c( ... )
1 2
, r Qi V� , by the following rules:
r Q a0 1
� ( ) ; r Q r r r a r r r k mk k k k k� � �
�
�
�
� �
� �
( ... ... ), ,
1
1
2
1
0
1
1 0 1 1
1 2 .
Let us also derive the inverse scheme: given q-encoding r r rm1 2 2
...
�
, construct a path c c cm1 2
... such that
r r r c c cm q m1 2 2 1 2
... ( ... )
�
� Enc .
THEOREM 1. If a a1 0
� and a r r r a r r rk k k�
��
� � �
1 2 1 0
1
1
2
1
1
1
... ... , k m� �2 1, , then the encoding� � �
�
r r rm1 2 2
... that
corresponds to the absolute encoding a a am1 2 1
��
equals to r r rm1 2 2
...
�
.
Proof. Indeed, we have
� � � �r Q a Q a0
1 0
0 0 0 1( ) ( ) ( , ( , , )) ;
� � � � � � � � �
� � � �
r Q r a r Q r r a r r Q r a r1 0
1
2 0 0
1
1 0
1
1
0 1 0
1
1
( ) ( ) ( ) r1
.
Then we use the mathematical induction method. Let r rt t� �for all t k� 1, . Let us show that r rk k� �
� �
1 1
:
� � � � � � � �
�
�
�
� �
�
r Q r r r a r r rk k k k k1
1
1
1
0
1
2 0 1
( ... ... )
� � � �
�
�
� �
��
� �
Q r r r r r r a r r rk k k k k
( ... ... ...
1
1
1
0
1
1 2 1 0
1
1 1
1
1
0 1 1 0
1
1
1
�
��
�
�
� � � � �r r r Q r a r rk k k k... ) ( ) .
Thus we have determined how q-enencoding can be used to obtain the absolute one from which, in turn, one can
obtain the sequence of path nodes by formula (1).
The theorem is proved.
Let us prove one of the important properties of relative encoding.
THEOREM 2. Turn of the path specified by the absolute encoding a a am1 2 1
��
provided that a a1 0
� , described by
the quaternion q QV� , does not change its encoding.
Proof. Let r r rm1 2 2
...
�
be the q-encoding that corresponds to the absolute one a a am1 2 1
...
�
such that a a1 0
� ,
a r r r a r r rk k k�
��
� � �
1 2 1 0
1
1
2
1
1
1
� ... , k m� �2 1, . Let us consider a quaternion q QV� and determine q-encoding for the path
600
![Page 5: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/5.jpg)
obtained by the turn corresponding to the quaternion q. Denote
a qa q qa q1 1
1
0
1
� � �
� �
,
a qa q qr r r a r r r q kk k k k� � � �
�
��
� � � �1
1 2 1 0
1
1
2
1
1
1 1
2... ... , , m�1.
Construct a relative encoding for a a a m1 2 1
� � ��
... :
� � � �
�
r Q a Q qa q q0 1 0
1
( ) ( ) ,
� � � � � � � �
� � � �
r Q r a r Q q a q Q q qa q q Q a Q1
0
1
2
0
1 1
2
1
2
1
2
( ) ( ) ( ) ( ) ( )r a r r0
1
2 01
�
� .
Then assume that r rt t� �for all t k� 1, and show by mathematical induction method that
� �� �
r rk k1 1
:
� � � � � � � � �
�
�
�
� �
�
r Q r r r a r r rk k k k k1
1
1
1
0
1
2 0 1
( ... ... )
� �
�
�
� � �
�
� �
�
Q r r r q qa q qr r Q r rk k k k k k
( ... ... ) (
1
1
1
1
1 1
2
1
1
1
1
1
1
1
2 1
� �
�
... ... )r a r rk k
� �
�
�
� � �
� �
Q r r r r a r r r rk k k k k( ... ... )
1
1
1
1
1
0
1
2 0 1 1
.
Thus, the q-encoding that corresponds to the absolute encoding� � �
�
a a am1 2 1
... is� � � �
� �
r r r r r rm m1 2 2 1 2 2
... ... .
The theorem is proved.
As a consequence, in combinatorial optimization problems where the path shape rather than location is important,
using q-encoding narrows the space of alternative solutions, which reduces the search time.
CONCLUSIONS
In combinatorial optimization problems where the solution space is paths in some discrete lattice, different
representations of a path as a mathematical object can be used. The expediency of choosing one representation or another
depends on the features of the problem. Absolute encoding is computationally simpler and distinguishes paths to within
translation. The proposed q-encoding stores information about the path shape but not about its position in the space; as a
result is has advantages when used in algorithms for solving the tertiary protein structure prediction problem such as immune
algorithms or ant colony optimization algorithms [6–8, 16, 17]. Quaternions allows constructing q-encoding in
three-dimensional lattices and substantially reduces the labor input of the computing procedures used earlier [18], which also
reduces the labor input of problem solution algorithms.
The proposed encoding can be used not only in algorithms to model and predict the structure of protein molecules but
also to solve other problems where it becomes necessary to analyze the space of three-dimensional curves specified in a
discrete lattice.
REFERENCES
1. K. A. Dill, “Theory for the folding and stability of globular proteins,” Biochemistry, No. 24(6), 1501–1509 (1985).
2. K. Dill, S. Bromberg, K. Yue, at al., “Principles of protein folding — A perspective from simple exact models,”
Protein Sci., No. 4, 561–602 (1995).
3. K. A. Dill, S. Banu Ozkan, M. Scott Shell, and T. R. Weikl, “The protein folding problem,” Ann. Rev. Biophysics.,
No. 37, 289–316 (2008).
4. P. Crescenzi, D. Goldman, C. Papadimitriou, et al., “On the complexity of protein folding,” J. Comput. Biology,
No. 5(3), 423–465 (1998).
5. S. Istrail and F. Lam, “Combinatorial algorithms for protein folding in lattice models: A survey of mathematical
results,” Commun. Inf. Syst., No. 9(4), 303–345 (2009).
601
![Page 6: Protein structure prediction problem: formalization using quaternions](https://reader031.vdocuments.mx/reader031/viewer/2022020616/575095f91a28abbf6bc67bb3/html5/thumbnails/6.jpg)
6. V. Cutello, G. Nicosia, M. Pavone, and J. Timmis, “An immune algorithm for protein structure prediction on lattice
models,” IEEE Trans. Evol. Comput., No. 11(1), 101–117(2007).
7. A. Shmygelska and H. Hoos, “An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein
folding problem,” BMC Bioinformatics, No. 6(30), 30–52 (2005).
8. S. Fidanova and I. Lirkov, “Ant colony system approach for protein folding,” Int. Conf. Multiconf. Comput. Sci. and
Inform. Techn. (2008), pp. 887–891.
9. H. Greenberg, W. Hart, and G. Lancia, “Opportunities for combinatorial optimization in computational biology,”
INFORMS J. Comput., No. 16(3), 211–231 (2004).
10. W. Wei and T. Yanlin, “A new algorithm for 2D hydrophobic-polar model: An algorithm based on hydrophobic core
in square lattice,” Pak. J. Biol. Sci., No. 11, 1815–1819 (2008).
11. P. Festa, “Optimization problems in molecular biology: A survey and critical review,” Int. Math. Forum, 3, No. 6,
269–289 (2008).
12. C. B. Anfisen, J. T. Edsall, and F. M. Richards, Advances in Protein Chemistry, Acad. Press, London (1965).
13. A. M. Gupal and I. V. Sergienko, Optimal Recognition Procedures [in Russian], Naukova Dumka, Kyiv (2008).
14. J. H. Conway and N. J. A. Sloane, Sphere Packings, Lattices and Groups, Springer-Verlag, New York (1998).
15. J. B. Kuipers, Quaternions and Rotation Sequences, Princeton Univ. Press, Princeton (1999).
16. V. Cutello, G. Morelli, G. Nicosia, et al., “On discrete models and immunological algorithms for protein structure
prediction,” Nat. Comput., No. 10, 91–102 (2011).
17. L. F. Hulianytskyi and V. O. Rudyk, “Analysis of the algorithms predicting protein tertiary structure based on the ant
colony optimization method,” in: V. Velichko, O. Voloshin, and K. Markov (eds.), Problems of Computer
Intellectualization, V. M. Glushkov Inst. of Cybernetics, ITHEA, Kyiv–Sofia (2012), pp. 152–159.
18. V. O. Rudyk, “Representing the protein structure in three-dimensional discrete lattices of arbitrary type,” Teoriya
Optym. Rishen’, No. 10, 38–47 (2011).
602