protein structure prediction problem: formalization using quaternions

6
Cybernetics and Systems Analysis, Vol. 49, No. 4, July, 2013 PROTEIN STRUCTURE PREDICTION PROBLEM: FORMALIZATION USING QUATERNIONS L. F. Hulianytskyi aand V. O. Rudyk aUDC 519.8 Abstract. The authors discuss the formalization of protein tertiary structure prediction problem based on Dill’s HP-model. Three-dimensional discrete lattices and different approaches to representing paths on them are the subjects of investigation. Two ways of path encoding are proposed and formalized, one of which is based on quaternions. Keywords: protein folding, tertiary protein structure molecule, discrete lattices, quaternions. INTRODUCTION The problem of predicting the tertiary structure of a protein molecule called also the protein folding problem has been recently one of the most important and investigated problems in computational biology. This structure plays a key role in determining the functional properties of proteins and is a source of information that is important to obtain fundamental and applied results in various fields of science and technology such as bioinformatics, medicine, pharmaceutics, computing geometry, and nanotechnologies. The essence of the protein tertiary structure prediction problem is as follows: given a linear sequence of elements constituting a molecule, determine its three-dimensional configuration. To solve this problem, the well-known experimental approaches are used (such as X-ray crystallography and magnetic-resonance spectroscopy); however, they are not only expensive and time-consuming but also not always give satisfactory results in practice. Therefore, mathematical methods have been widely applied in recent years in the analysis of the structure of molecules. The biophysical protein folding HP-model proposed by K. Dill in 1985 is most extensively studied: it finds the molecule structure that minimizes energy potential [1–3]. The mathematical solution of the problem is a continuous two-dimensional or three-dimensional curve without self-intersections. An overwhelming majority of the well-known protein structure models are discrete since the shape of the molecule is represented by a path in a discrete lattice. Noteworthy is that even in case of significantly simplified chemical and biological properties of protein, the mathematical modeling involves NP-hard optimization problems [4]. To simplify the model, two-dimensional or three-dimensional cubic lattices are considered most often [5–11], though they have shortcomings in the context of this problem. Passing to more complex lattices necessitates path encoding as a mathematical object that reflects its characteristics in the best way; the search using various optimization algorithms will be carried out among these objects. An important condition of the successful application of the methods of modeling protein spatial structure is the choice of adequate mathematical tools for the formal description of the problem. In the present paper we will briefly outline the principles that underlie Dill’s model and present the properties of lattices as mathematical objects, in particular, the invariance with respect to transitions and revolutions and the neighboring of nodes. We will propose two approaches to path encoding in three-dimensional lattices, which have different properties in their construction and use in protein structure prediction algorithms. A special attention will be paid to the application of quaternion tools for the design of q -encoding. 597 1060-0396/13/4904-0597 © 2013 Springer Science+Business Media New York a V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine, [email protected]; [email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4, July–August, 2013, pp. 130–136. Original article submitted January 24, 2013.

Upload: v-o

Post on 12-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein structure prediction problem: formalization using quaternions

Cybernetics and Systems Analysis, Vol. 49, No. 4, July, 2013

PROTEIN STRUCTURE PREDICTION PROBLEM:

FORMALIZATION USING QUATERNIONS

L. F. Hulianytskyia†

and V. O. Rudyka‡

UDC 519.8

Abstract. The authors discuss the formalization of protein tertiary structure prediction problem based

on Dill’s HP-model. Three-dimensional discrete lattices and different approaches to representing paths

on them are the subjects of investigation. Two ways of path encoding are proposed and formalized, one

of which is based on quaternions.

Keywords: protein folding, tertiary protein structure molecule, discrete lattices, quaternions.

INTRODUCTION

The problem of predicting the tertiary structure of a protein molecule called also the protein folding problem has been

recently one of the most important and investigated problems in computational biology. This structure plays a key role in

determining the functional properties of proteins and is a source of information that is important to obtain fundamental and

applied results in various fields of science and technology such as bioinformatics, medicine, pharmaceutics, computing

geometry, and nanotechnologies.

The essence of the protein tertiary structure prediction problem is as follows: given a linear sequence of elements

constituting a molecule, determine its three-dimensional configuration. To solve this problem, the well-known experimental

approaches are used (such as X-ray crystallography and magnetic-resonance spectroscopy); however, they are not only

expensive and time-consuming but also not always give satisfactory results in practice. Therefore, mathematical methods

have been widely applied in recent years in the analysis of the structure of molecules.

The biophysical protein folding HP-model proposed by K. Dill in 1985 is most extensively studied: it finds the

molecule structure that minimizes energy potential [1–3]. The mathematical solution of the problem is a continuous

two-dimensional or three-dimensional curve without self-intersections. An overwhelming majority of the well-known

protein structure models are discrete since the shape of the molecule is represented by a path in a discrete lattice. Noteworthy

is that even in case of significantly simplified chemical and biological properties of protein, the mathematical modeling

involves NP-hard optimization problems [4].

To simplify the model, two-dimensional or three-dimensional cubic lattices are considered most often [5–11], though

they have shortcomings in the context of this problem. Passing to more complex lattices necessitates path encoding as

a mathematical object that reflects its characteristics in the best way; the search using various optimization algorithms will be

carried out among these objects.

An important condition of the successful application of the methods of modeling protein spatial structure is the choice

of adequate mathematical tools for the formal description of the problem. In the present paper we will briefly outline the

principles that underlie Dill’s model and present the properties of lattices as mathematical objects, in particular, the

invariance with respect to transitions and revolutions and the neighboring of nodes.

We will propose two approaches to path encoding in three-dimensional lattices, which have different properties in

their construction and use in protein structure prediction algorithms. A special attention will be paid to the application of

quaternion tools for the design of q-encoding.

5971060-0396/13/4904-0597

©

2013 Springer Science+Business Media New York

a

V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine,

[email protected];

[email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4,

July–August, 2013, pp. 130–136. Original article submitted January 24, 2013.

Page 2: Protein structure prediction problem: formalization using quaternions

598

DILL’S MODEL

Protein molecules consist of amino acids sequentially connected by peptide bond. A chain of amino acids is called

primary protein structure. Hydrogen bonds are formed in the interaction inside the molecule; as a result, parts of the chain

are folded in spatial structures: �-helices, �-sheets, and other fragments that constitute protein secondary structure [12, 13].

Then these elements form some spatial configuration, which is tertiary protein structure. Depending on the properties, all

amino acids are divided into two classes: hydrophobic and polar. When a molecule folds in a polar medium (water), polar

amino acids move to the molecule surface, and hydrophobic ones move inside it. Hydrophobic bonds occur between closely

spaced hydrophobic amino acids and play a key role in the formation of tertiary protein structure. The bonds inside the

molecule determine its energy. According to the thermodynamic hypothesis, protein takes the structure that minimizes its

free energy. This structure is called native conformation.

In studying proteins, the dogma “sequence–structure–functionality” is used. We mean that protein functions directly

depend on its spatial configuration, which in turn is uniquely determined by its primary structure [12]. However, there also

exist conformer proteins with two native conformations. As a rule, their energy differs insignificantly in such cases.

The protein tertiary structure prediction problem is: given a sequence of amino acids, determine the protein spatial

configuration. The methods of its solution are divided into two classes: statistical and de novo. Statistical methods are based

on the fact that similar parts of primary sequence fold into similar three-dimensional forms. The database of the known

proteins is analyzed for similarity with the given protein and as a result, the tertiary structure is constructed. Using such

approach provides a satisfactory accuracy; however, similar sequences can be found by no means for all primary sequences.

In such cases, de novo methods are applied, which do not employ additional information except for the primary structure,

and the energy minimization problem is stated.

The hydrophobic polar Dill’s model pertains to de novo methods. In this model, the primary protein structure is

specified by a sequence S n� � � �1 2

� , � i H P�{ }, i n� 1, , where H and P are a hydrophobic and a polar amino acids,

respectively, and n is the number of amino acids in the molecule. Tertiary structure is a non-self-intersecting path of the

corresponding length in a certain discrete lattice. Amino-acid residues are sequentially located at each node of the path. It is

assumed that hydrophobic bonds occur among hydrophobic residues in neighboring nodes (but not in those neighboring in

the primary sequence). Energy of the structure is the number of bonds in it with the minus sign. The problem is to find the

structure with minimum energy. More formally, if an amino acid � i in the structure S is associated with a node U i( )� , the

energy is defined by the formula

E S I U U h hi j

i j n

i j( ) ( ( ), ( )) ( ) ( )� �

� � � � �

�� � � �

1 2 2

,

where

I U UU U

( , )

,

1 2

1 2

1

0

if nodes and are neighboring

othetwise;

hH

P( )

,

.

1

0

if

if

LATTICES

The term “lattice” is used in various fields of mathematics. In group theory, a lattice L in a Euclidean space �n

is

defined as a discrete subgroup �n

[14]. It can be represented as a set of vectors L a e ai i

i

m

i� �

��

1

| � , where

B e e em

n� { }

1 2

, , ,� � is some basis and � is the set of integers. Different bases can generate identical lattices. In what

follows, we will consider lattices in a three-dimensional Euclidean space and call their elements nodes.

A lattice is said to be invariant with respect to the mapping fn n

: � �� if f v L( )� for any v L� . Since a lattice is

a subgroup in �n

, it is invariant with respect to translation by any vector v L� .

Neighboring in a Lattice. To formalize the concept of a path in a lattice, it is necessary to introduce the concept of

neighboring nodes. To this end, we specify the binary neighboring relation R L L � : a node v L� is neighboring for the

node u L� if and only if ( , )u v R� . Neighboring in a lattice is invariant with respect to the mapping fn n

: � �� if the

lattice is invariant with respect to f and for two neighboring nodes u v L, � the nodes f u( ) and f v( ) are also neighboring. If

the neighboring relation is invariant with respect to translation by any vector v L� , then it can be defined using some set of

Page 3: Protein structure prediction problem: formalization using quaternions

neighborhood vectors V v v Ls� { }

1

, ,� : a node c L1

� is neighboring to a node c L2

� if and only if c c V1 2

� � . And vice

versa, if neighboring is specified as described above, it is invariant with respect to translation by any vector v L� . For

a neighboring relation to be symmetric, the condition v V v V� � � � should be satisfied.

By a path of length m in a lattice L with neighboring R is meant a sequence c c c cm m( )

�1 2

� such that c Li � ,

i m� 1, , and the condition ( , )c c Ri i� �1

, i m� �1 1, (connectivity condition) is satisfied. Note that a path does not contain

self-intersections if the condition c ci j� yields i j� , i j m, ,� 1 .

ENCODING

In solving combinatorial optimization problems whose solution space is paths in some discrete lattice, it often turns

out to be inefficient to represent a path directly as a sequence of coordinates of lattice nodes since not any such sequence

satisfies the path connectivity condition. An alternative representation can be other ways of path encoding.

Let � be the set of natural numbers. In what follows, by a path encoding c c cm1 2

� , denoted by Enc ( ... )c c cm1 2

� s s sk1 2

� , we will mean a sequence s s sm p1

2

, s Si � , i m p� �1, , where S is encoding alphabet, and a p �� is

encoding parameter if the following conditions are satisfied:

— for any m�� and s s s Sm p1 2

, , ,�

� there exist c c c Lm1 2

, , ,� � such that equality Enc ( )c c cm1 2

� ��

s s sm p1 2

is true;

— if Enc ( ... ) ...c c c s s sm m p1 2 1 2

��

, then Enc ( ... ) ...c c c c s sm m1 2 1 1 2�

� s sm p m p� � �1

.

An encoding is invariant with respect to a mapping fn n

: � �� if the lattice and the neighboring relation in it are

invariant with respect to f and for any m��, c c c Lm1 2

, , ,� � the condition Enc ( ... )c c c s s sm m p1 2 1 2

��

� yields

Enc ( ( ) ( ) ( )) ...f c f c f c s s sm m p1 2 1 2

� ��

.

Encoding divides the set of paths in a lattice into equivalence classes: paths c c cm1 2

... and� � �c c cm1 2

... belong to one

class if and only if Enc Enc( ) ( ... )c c c c c cm m1 2 1 2

� � � � �.

Absolute Encoding. An encoding defined by the formula

Encabs m m mc c c c c c c c c( ... ) ( )( )... ( )

1 2 2 1 3 2 1

� � � ��

(1)

is called absolute. Its alphabet is the whole set V, and the value of the parameter p � �1. It can be shown that absolute

encoding is invariant with respect to translation by any vector v L� and divides the set of paths into equivalence

classes inside which elements are equal to within translation.

It is convenient to use absolute encoding to solve problems where the coordinates of the initial point in paths are not

important. An advantage of absolute encoding over a sequence of node coordinates is that the path connectivity condition is

satisfied automatically.

Applying Quaternions in Protein Structure Prediction Problems. In what follows, we will use the concept of

quaternions to describe the process of revolution. Let us present the main information [15] necessary to formalize the

problem under study.

A field of quaternions H is a set of pairs of the form ( , )a u�

, where a �� and

u ��3

with addition and multiplication

operations defined as follows:

( , ) ( , ) ( , )a u b v a b u v� � � �

� � � � ,

( , )( , ) ( , )a u b v ab u v av bu u v� � � � � � � �

� � � � � � .

Point and symbol � denote scalar and vector products, respectively.

The norm of a quaternion q x y z� ( , ( , , ))� is naturally defined as | | | |q x y z� � � ��2 2 2 2

. The quaternion

conjugate to the given quaternion q x y z� ( , ( , , ))� , is the quaternion q x y z* ( , ( , , ))� �� .

The formula to calculate the inverse quaternion in multiplication has the form

qq

q

1

2

*

| | | |

.

599

Page 4: Protein structure prediction problem: formalization using quaternions

A three-dimensional vector

v ��3

can be considered as a quaternion ( , )0

v . Let q u� ( , )�

be a quaternion with unit

norm and

v ��3

. Then the result of the revolution of vector

v about the axis

u by an angle � �� 2arccos can be represented

as the product qvq�

�1

.

Constructing a q-Encoding. Let us consider a three-dimensional lattice L with the set of neighborhood vectorsV, for

which the following property holds: if v V� is a fixed vector, then

� � � � � � � � � � �

� � �

��

v V q v V q vq V q v q vv v v v vH: ( ) ( )

1 1

.

(2)

If we denote Q Q v v VV � �{ }( )| , then it can be shown that such lattices are invariant with respect to revolutions

specified by quaternions q QV� .

In the two-dimensional case, along with absolute encoding, relative one is used, where the position of the next amino

acid is specified with respect to the previous one [7]. We will employ the concept of quaternions and design the procedure

that uses absolute path encoding to construct its encoding Encq , and will call it q-encoding. Note that q-encoding is an

analog of relative encoding, used in lattices on a plane, in three-dimensional case.

Let us fix some neighborhood vector a V0

� . From condition (2) it follows that there exists a function Q V Qv: � such

that for all a V� the equality Q a a Q a a( ) ( )

0

1�

� holds; as Q a( )

0

we take identical quaternion, i.e., that describing zero turn

Q a( ) ( , ( , , ))

0

0 0 0 1� . Let us prove the following statement.

Statement. If q QV� , then Q qa q q( )

0

1�

� .

Proof. From the condition q QV� it follows that there exists a vector v V� such that Q v q( ) � and simultaneously

qa q v0

1�

� . Then substituting the second equality into the first one yields the required statement. The proof is completed.

Let now absolute path encoding a a am1 2 1

...

� Encabs mc c c( ... )

1 2

calculated by formula (1) be given. We will

construct a q-encoding r r rm1 2 2

...

� Encq mc c c( ... )

1 2

, r Qi V� , by the following rules:

r Q a0 1

� ( ) ; r Q r r r a r r r k mk k k k k� � �

� �

� �

( ... ... ), ,

1

1

2

1

0

1

1 0 1 1

1 2 .

Let us also derive the inverse scheme: given q-encoding r r rm1 2 2

...

, construct a path c c cm1 2

... such that

r r r c c cm q m1 2 2 1 2

... ( ... )

� Enc .

THEOREM 1. If a a1 0

� and a r r r a r r rk k k�

��

� � �

1 2 1 0

1

1

2

1

1

1

... ... , k m� �2 1, , then the encoding� � �

r r rm1 2 2

... that

corresponds to the absolute encoding a a am1 2 1

��

equals to r r rm1 2 2

...

.

Proof. Indeed, we have

� � � �r Q a Q a0

1 0

0 0 0 1( ) ( ) ( , ( , , )) ;

� � � � � � � � �

� � � �

r Q r a r Q r r a r r Q r a r1 0

1

2 0 0

1

1 0

1

1

0 1 0

1

1

( ) ( ) ( ) r1

.

Then we use the mathematical induction method. Let r rt t� �for all t k� 1, . Let us show that r rk k� �

� �

1 1

:

� � � � � � � �

� �

r Q r r r a r r rk k k k k1

1

1

1

0

1

2 0 1

( ... ... )

� � � �

� �

��

� �

Q r r r r r r a r r rk k k k k

( ... ... ...

1

1

1

0

1

1 2 1 0

1

1 1

1

1

0 1 1 0

1

1

1

��

� � � � �r r r Q r a r rk k k k... ) ( ) .

Thus we have determined how q-enencoding can be used to obtain the absolute one from which, in turn, one can

obtain the sequence of path nodes by formula (1).

The theorem is proved.

Let us prove one of the important properties of relative encoding.

THEOREM 2. Turn of the path specified by the absolute encoding a a am1 2 1

��

provided that a a1 0

� , described by

the quaternion q QV� , does not change its encoding.

Proof. Let r r rm1 2 2

...

be the q-encoding that corresponds to the absolute one a a am1 2 1

...

such that a a1 0

� ,

a r r r a r r rk k k�

��

� � �

1 2 1 0

1

1

2

1

1

1

� ... , k m� �2 1, . Let us consider a quaternion q QV� and determine q-encoding for the path

600

Page 5: Protein structure prediction problem: formalization using quaternions

obtained by the turn corresponding to the quaternion q. Denote

a qa q qa q1 1

1

0

1

� � �

� �

,

a qa q qr r r a r r r q kk k k k� � � �

��

� � � �1

1 2 1 0

1

1

2

1

1

1 1

2... ... , , m�1.

Construct a relative encoding for a a a m1 2 1

� � ��

... :

� � � �

r Q a Q qa q q0 1 0

1

( ) ( ) ,

� � � � � � � �

� � � �

r Q r a r Q q a q Q q qa q q Q a Q1

0

1

2

0

1 1

2

1

2

1

2

( ) ( ) ( ) ( ) ( )r a r r0

1

2 01

� .

Then assume that r rt t� �for all t k� 1, and show by mathematical induction method that

� �� �

r rk k1 1

:

� � � � � � � � �

� �

r Q r r r a r r rk k k k k1

1

1

1

0

1

2 0 1

( ... ... )

� �

� � �

� �

Q r r r q qa q qr r Q r rk k k k k k

( ... ... ) (

1

1

1

1

1 1

2

1

1

1

1

1

1

1

2 1

� �

... ... )r a r rk k

� �

� � �

� �

Q r r r r a r r r rk k k k k( ... ... )

1

1

1

1

1

0

1

2 0 1 1

.

Thus, the q-encoding that corresponds to the absolute encoding� � �

a a am1 2 1

... is� � � �

� �

r r r r r rm m1 2 2 1 2 2

... ... .

The theorem is proved.

As a consequence, in combinatorial optimization problems where the path shape rather than location is important,

using q-encoding narrows the space of alternative solutions, which reduces the search time.

CONCLUSIONS

In combinatorial optimization problems where the solution space is paths in some discrete lattice, different

representations of a path as a mathematical object can be used. The expediency of choosing one representation or another

depends on the features of the problem. Absolute encoding is computationally simpler and distinguishes paths to within

translation. The proposed q-encoding stores information about the path shape but not about its position in the space; as a

result is has advantages when used in algorithms for solving the tertiary protein structure prediction problem such as immune

algorithms or ant colony optimization algorithms [6–8, 16, 17]. Quaternions allows constructing q-encoding in

three-dimensional lattices and substantially reduces the labor input of the computing procedures used earlier [18], which also

reduces the labor input of problem solution algorithms.

The proposed encoding can be used not only in algorithms to model and predict the structure of protein molecules but

also to solve other problems where it becomes necessary to analyze the space of three-dimensional curves specified in a

discrete lattice.

REFERENCES

1. K. A. Dill, “Theory for the folding and stability of globular proteins,” Biochemistry, No. 24(6), 1501–1509 (1985).

2. K. Dill, S. Bromberg, K. Yue, at al., “Principles of protein folding — A perspective from simple exact models,”

Protein Sci., No. 4, 561–602 (1995).

3. K. A. Dill, S. Banu Ozkan, M. Scott Shell, and T. R. Weikl, “The protein folding problem,” Ann. Rev. Biophysics.,

No. 37, 289–316 (2008).

4. P. Crescenzi, D. Goldman, C. Papadimitriou, et al., “On the complexity of protein folding,” J. Comput. Biology,

No. 5(3), 423–465 (1998).

5. S. Istrail and F. Lam, “Combinatorial algorithms for protein folding in lattice models: A survey of mathematical

results,” Commun. Inf. Syst., No. 9(4), 303–345 (2009).

601

Page 6: Protein structure prediction problem: formalization using quaternions

6. V. Cutello, G. Nicosia, M. Pavone, and J. Timmis, “An immune algorithm for protein structure prediction on lattice

models,” IEEE Trans. Evol. Comput., No. 11(1), 101–117(2007).

7. A. Shmygelska and H. Hoos, “An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein

folding problem,” BMC Bioinformatics, No. 6(30), 30–52 (2005).

8. S. Fidanova and I. Lirkov, “Ant colony system approach for protein folding,” Int. Conf. Multiconf. Comput. Sci. and

Inform. Techn. (2008), pp. 887–891.

9. H. Greenberg, W. Hart, and G. Lancia, “Opportunities for combinatorial optimization in computational biology,”

INFORMS J. Comput., No. 16(3), 211–231 (2004).

10. W. Wei and T. Yanlin, “A new algorithm for 2D hydrophobic-polar model: An algorithm based on hydrophobic core

in square lattice,” Pak. J. Biol. Sci., No. 11, 1815–1819 (2008).

11. P. Festa, “Optimization problems in molecular biology: A survey and critical review,” Int. Math. Forum, 3, No. 6,

269–289 (2008).

12. C. B. Anfisen, J. T. Edsall, and F. M. Richards, Advances in Protein Chemistry, Acad. Press, London (1965).

13. A. M. Gupal and I. V. Sergienko, Optimal Recognition Procedures [in Russian], Naukova Dumka, Kyiv (2008).

14. J. H. Conway and N. J. A. Sloane, Sphere Packings, Lattices and Groups, Springer-Verlag, New York (1998).

15. J. B. Kuipers, Quaternions and Rotation Sequences, Princeton Univ. Press, Princeton (1999).

16. V. Cutello, G. Morelli, G. Nicosia, et al., “On discrete models and immunological algorithms for protein structure

prediction,” Nat. Comput., No. 10, 91–102 (2011).

17. L. F. Hulianytskyi and V. O. Rudyk, “Analysis of the algorithms predicting protein tertiary structure based on the ant

colony optimization method,” in: V. Velichko, O. Voloshin, and K. Markov (eds.), Problems of Computer

Intellectualization, V. M. Glushkov Inst. of Cybernetics, ITHEA, Kyiv–Sofia (2012), pp. 152–159.

18. V. O. Rudyk, “Representing the protein structure in three-dimensional discrete lattices of arbitrary type,” Teoriya

Optym. Rishen’, No. 10, 38–47 (2011).

602