[ieee 2006 ieee symposium on computational intelligence and bioinformatics and computational biology...

A Differential Evolution Approach for Protein Folding

R. Bitello and H.S. LopesBioinformatics Laboratory/CPGEI,

Federal Technical University - Paran'aAv. 7 de setembro, 3165-

Abstract- This work presents differential evolution (DE), ap-plied to the problem of protein folding, using the hydrophobic-polar model. Protein folding is a relevant problem in bioinfor-matics for which many heuristic algorithms have been proposed.DE is a relatively recent evolutionary algorithm, and has beenused successfully to several engineering optimization problems,usually with continuous variables. We introduce the conceptof genotype-phenotype mapping in DE in order to map agiven folding into a real-valued vector. The methodology isdetailed and several experiments with benchmarks are done.We compared results with other implementations and DE hasshown to be competitive, robust and very promising.

I. INTRODUCTIONIn biological systems, proteins are the most abundant and

functionally diverse molecules and almost all vital processesdepend on these macromolecules, which are composed byamino acids chains. The common 20 different types of aminoacids can be combined in a linear sequence having thenecessary information for the generation of a unique tri-dimensional structure. The exact way proteins fold just afterbeing synthesized in the ribosome is still unknown. As conse-quence, many computational approaches have been proposedto simulate the protein folding [1]. However, even simplemodels, such as the one used in this work, is still computa-tionally expensive. During recent years, several methods havebeen proposed in the quest of trying to unravel the proteinfolding problem (PFP), such as Monte Carlo simulation [2],genetic algorithms [3], ant colony optimization [4]. However,the PFP is still an important issue in bioinformatics.

The objective of this work is to verify in what extentDifferential Evolution algorithm is applicable to the PFPusing the 2D-HP model, and to compare its performancewith other similar algorithms published recently.

II. THE 2D-HP MODELAmongst the several models used to simulate how a

protein folds, the hydrophobic-polar (HP) possibly is themost simple and most used. The HP model was proposed byDill [5], who demonstrated that some behavioral propertiesof real-world proteins could be inferred by using this model.In this model, a protein is represented as an heteropolymerwith predefined hydrophobicity patterns. That is, the aminoacids of a protein are considered either hydrophobic (aversion

This work was supported by the Brazilian National Research Council -

CNPq, under grants 305720/2004-0 and 506479/04-8.Corresponding author hslopes@pesquisador. cnpq. br

80230-901 Curitiba, Brazil

to water) or polar (affinity to water, the same as hydrophilic)[6]. In a simple way, a protein is a string of characters,defined over the alphabet {H,P}. Every amino acid of thechain is allowed to occupy a given position in a squarelattice. Therefore, the angle between consecutive amino acidsis multiple of 90 degrees. In a valid conformation, adjacentamino acids in the sequence are also adjacent in a bi (2D) ortri-dimensional (3D) lattice, and all points of the lattice canbe occupied by at most one amino acid. Despite its simplicity,exact algorithms to solve the problem, are NP-hard.

The HP model is based on the concept that the largestcontribution to the free energy of a native conformationof a protein is due to the interaction between hydrophobicamino acids. These amino acids tend to group in the coreof the folded protein, leaving the polar (hydrophilic) aminoacids towards the outer part of the protein, in contact withthe environment. The free energy of a conformation is in-versely proportional to the number of non-local hydrophobic-hydrophobic bounds (or H-H contacts). A non-local H-Hcontact is a pair of hydrophobic amino acids that are notadjacent in the sequence, but occupy adjacent positions inthe (2D or 3D) lattice. Since the number of H-H contactsthat occur in successive positions in the amino acids chainis fixed, the free energy depends only on the number ofnon-consecutive H-H contacts that are adjacent in the lat-tice. Therefore, minimizing the free energy is equivalent tomaximizing the number of H-H contacts. Figure 1 presentsthe 2D-HP model for a 18 amino acids-long protein foldedin such a way that 6 non-local H-H contacts occur. In thisfigure, black dots are hydrophobic amino acids and whitedots, the polar ones. The H-H contacts are represented bydotted lines.

The free energy function suggested by [7] is representedin equation 1:

E = Z ev vjA (ri- rj)i<j

(1)

where: A(ri -rj) = 1, if amino acids ri and rj constitute anon-local contact, and A(ri-rj) = 0, otherwise. Dependingon the type of contacts between amino acids, the energy e,ivjwill be eHH, eHP or epp, corresponding to H-H, H-P andP-P contacts, respectively.

According to [7] this model satisfies the following physicallimitations:

Fig. 1. Example of H-H contacts in the 2D-HP model.

1) Compact conformations have smaller energy value thanany other non-compact conformation.

2) Hydrophobic amino acids will be buried inwards theconformation, as most as possible. This idea is ex-

pressed by the relationship epp > eHP > eHH, thatdecreases the energy of conformations in which the Hsare hidden inside.

3) Different types of amino acids tend to get apart. Thisis expressed by the relationship: 2eHp > epp > eHH

Possible values for eHH could be -2.3, -1 and 0, since theysatisfy conditions 2 and 3 above. Results are not sensitive tothe specific value of eHH provided these two conditions are

met.

III. DIFFERENTIAL EVOLUTION

Differential Evolution (DE) is considered an evolutionarycomputation method and was proposed by Storn and Pricearound a decade ago [8], [9] for solving the Chebychevpolynomial fitting problem. The central idea of this algorithmis the use of difference vectors for generating perturbations ina population of vectors. This algorithm is specially suited foroptimization problems where possible solutions are definedby a real-valued vector. This algorithm usually offers fastconvergency, robustness, conceptual simplicity, few param-

eters and easy implementation. As a consequence, it hasdrawn attention of researchers who have studied its utilityfor complex optimization problems [10].

The first step in using DE for a real-world problem is todetermine the initialization of parameters, as follows:

Population size: represent the number of possible solu-tions that the algorithm will handle at the same time.For most problems, a small population is enough.Dimension of individuals - nDim: This parameter de-fines the length of the vectors that represent individuals.Each element of the vectors represent a variable of theproblem.

* Range of variables: For each variable of the problem,its upper and lower bounds should be defined.

* Weighing factor F: This weight is applied over thevector resulting from the difference between pairs ofvectors (X2 and X3). Typically F is a real-valuedparameter in the range [0, 2].

* Maximum crossover probability CR: It is defined in therange [0, 1]. It is the probability of crossing over a givenvector of the population (Xi) and a vector created fromthe weighted difference of two vectors (F. (X27- X3)),that are applied to another vector. This latter vector canbe randomly chosen (Xi) or the one with the best fitnessfound up to the moment (Xbest). The final result of theoperation is a candidate vector (Xcardidate)

* Strategy for vector operations: Ten different vectoroperations were originally defined by [9], [11]. Thechoice of this strategy is by trial-and-error. Good resultsobtained by using a given strategy in a problem usuallycannot be extrapolated to another problem.

* Stop criterion: Usually, the time-out criterion is the mostwidely used, that is, the algorithm stops after a fixednumber of iterations.

After defining these control parameters, the initial popu-lation is be randomly created. In order to have the coverageof the search space as even as possible, a random numbergenerator with uniform probability distribution should beused. Next, the fitness function of each individual of thepopulation have to be evaluated, according to the specificmeaning of the elements of the vector. While the stopcriterion, previously set, was not met, the following loopis repeated:

* For each individual Xi of the population do the fol-lowing. Choose at random three other vectors of thepopulation, namely, Xl, X2 and X3. Alternatively,vector X1 could be substituted by the one with the bestfitness value up to now (Xbest). This may lead to afaster convergence.

* For each element of vector Xi, generate a randomnumber rnd in the range [0,1]. If rnd < CR, thecurrent element of the vector substitutes the correspond-ing element of the same index in vector Xmodified(described in the next item) generating, at the end, anew individual, xcandidate, This operation is somewhatequivalent to the crossover operator, commonly used ingenetic algorithms. There are several variations of thisprocedure.

* For each element of vector Xl, apply the selectedstrategy for vector operation. The result of this operationis a vector named Xmodified, for instance, Xmodified =F. (X2- X3) . Usually, when the value of an element ofthe vector extrapolates its predefined range, the closestbound (upper or lower) is attributed to the value. Thevector operations over X1 are equivalent to the mutationoperator in genetic algorithms. For large populations,such vector operations can include the difference amongtwo to four vectors. This procedure makes convergence

faster.* Evaluate the fitness of vector Xcardidate according to

its the specific to the problem in hand.* If this fitness is smaller than the fitness of Xi, that

is, f(Xcandidate) < f(Xi), vector Xi is substitutedby Xcandidate 1. This operation is equivalent to theselection procedure in genetic algorithms.If Xcandidate, just included in the population, has afitness smaller than Xbest, than the new Xbest will beXcandidate -

Figure 2 shows graphically how all those previously oper-ations take place in a given generation, for a 2-dimensionalspace.

d 2?

Minaimulm

/~~~ ~~~XY +F(~~~~~~~~canldidate i t 2

Basically, there are three ways of representing an aminoacid chain in a lattice using the HP model [12]:

* Cartesian coordinates: this representation is straight-forward, but it is sensitive to translation and rotationin such a way that identical conformations can havecompletely different sets of coordinates.

* Internal coordinates: in this representation the positionof an amino acid is associated to the previous one, andthere are two possibilities: absolute reference or relativereference. When using absolute internal coordinates, agiven movement is defined according to the axis of thelattice, and when using relative internal coordinates, amovement is defined according to the previous one.

* Geometrical distance: this representation describes astructure using a matrix of with distances between allpairs of points.

The proposed implementation with DE uses relative coor-dinates, and so, in a bi-dimensional space there are only threepossible movements: (F)orward, (R)ight and (L)eft. There-fore, the phenotypical representation of a solution is definedover the alphabet {F, R, L}. The genotypical representationis still a real-valued vector. Considering xij the j-th elementof vector Xi, and P the string representing the sequence ofmovements of the folding, and a < 3 < d < ai arbitraryconstants in R, the genotype-phenotype mapping is definedby equation 2:

If a <xij < /3 then P. = LIf /3 <xij < d then P. = FIf d < xcij < -y then P. = R

(2)

Fig. 2. Representation (in 2D) in a given generation, of a candidate vector,obtained by means of vector operations.

IV. METHODOLOGYIn this section we present in details how the ED algorithm

was adapted to deal with the PFP, using the 2D-HP model.The DE algorithm was originally developed to deal with

vectors of continuous variables. However, the PFP using the2D-HP model is inherently discrete, and so, some adaptationswere necessary in the algorithm.

A. Vector encodingIn DE, individuals encode in a vector the variables of

the problem. Usually, the meaning of the elements of suchvector to the real-world is straightforward. Consequently,there is not the concept of genotype. On the other hand,for the specific problem dealt in this work the adaptationdevised to represent possible solutions to the PFP in a real-valued vector led to the establishment of genotype-phenotypemapping. Individuals in DE are real-valued vectors which, inturn, are decoded into a specific fold of an amino acid chainin a square lattice. The reason for this approach was to use

the original DE algorithm, without significant changes.

'Here we consider a minimization problem

Notice that the proposed mapping allows to privilege somemovements by enlarging the corresponding range in which itis defined (or narrowing the other ranges). This strategy canbe useful during evolution to better adapt to specific char-acteristics of the folding. Furthermore, this mapping allowsseveral genotypes to represent a single phenotype. A givenfolding of N amino acids represented in a N-dimensionalvector is defined by a string with N -1 movements.When applying DE to a constrained problem, such as

the PFP, unfeasible solutions may appear during evolution.There are three basic strategies for dealing with this problem:discard solution; fix solution; accept solution with a penaltyproportional to the extent of violations of the constraints.The last alternative is interesting specially when there ischance for the violations to be fixed by themselves alongthe evolution. Fix solutions frequently is computationallyintensive or too complex to be done. Here, we adopted thefirst alternative: when an individual represents an invalidfolding (that is, there is more than one amino acid in agiven position in the lattice), it is discarded. Further workwill evaluate other strategies.

B. Initial populationThe simplest way to generate the initial population with-

out invalid individuals at phenotypical level is creating"stretched" individuals, that is, all elements of the string are

F. Although the initial population is exactly the same at thephenotypical level, all their elements are quite different atgenotypical level, thanks to the genotype-phenotype mappingdefined before. Recall that any element of the vectors arerandom numbers in the predefined range. This procedurewarrants an initial population with valid individuals and withreasonable diversity, a necessary condition to evolution.

C. Fitness functionBefore evaluating an individual, the real-valued vector is

decoded to a string of over the alphabet {F, R, L}. Next,the string is converted in a set of Cartesian coordinates,assuming that the folding always starts at the center of thelattice (coordinates {0,0}). This set of coordinates effectivelyrepresents how the protein folds in the bi-dimensional lattice.Next, this folding is evaluated by counting the number ofnon-local H-H contacts, as defined in section II and equation1. This fitness is based on the assumption that the non-local H-H contacts are the main force driving the foldingof a protein. Therefore, considering the maximization of thenumber of non-local H-H contacts, for each contact, the freeenergy in equation 1 is increased by 1. For this study, wedefined eHH, eHp and epp equal to 1, 0 and 0, respectively.

D. Strategies

Storn and Price [ 1] developed a set of strategies that allowa large number of options, depending on the nature of theproblem. Such strategies are classified as follows:

1) Vector to be disturbed: it can be a randomly chosenvector of the population (rand) or the vector withthe best fitness value (best). Vectors randomly chosenlead to a richer diversity whilst the other option theconvergence will be faster.

2) Number of weighted differences: for a small populationthe weighted difference of only two vectors is moreusual. For larger populations authors have shown thatfour vectors are more effective regarding convergence.

3) Type of crossover: it can be binomial (bin), when allthe elements of the vector have a probability CR forcrossover; and exponential (exp), when crossover isdone while a randomly chosen value is less or equalto CR.

The choice of the strategy is done by trial-and-error, sincethere is no well-established procedure for choosing the beststrategy for a given problem.An interesting approach for aiming at keeping the diversity

in the population along the search, but at the same timefacilitating convergence, is alternating strategies, as follows.Use the strategy Best2Exp [11] while some improvement isobserved in the best fitness for the last N generations. Thisstrategy aims at a fast convergence. Next, when the numberof generations without improvement in the best fitness isequal to or larger than N, change to strategy Rand2Exp, andkeep it for up M generations without improvement. This laststrategy aims at improving diversity. Case the best fitness is

turn back the strategy Best2Exp, clear counters N and M,and repeat the cycle.

V. COMPUTATIONAL EXPERIMENTS ANDRESULTS

For testing de DE algorithm, we used a benchmark of 9chains found in the literature. Table I shows the instancesused, including the number of amino acids, the amino acidschain translated to the HP model and the number of non-localH-H contacts known.

TABLE I

BENCHMARKS USED IN THE EXPERIMENTS.

#aa HP chain E20 HPHP2H2 PHP2HPH2 P2HPH 924 H2P2Hp2Hp2Hp2HP2Hp2Hp2H2 925 p2Hp2H2 p4H2 P4H2P4H2 836 P3H2P2 H2 P5H7P2H2 p4H2P2Hp2 1448 P2HP2H2P2H2P5H10 p6H2P2H2P2

Hp2H5 2350 H2PHPHPHPH4PHP3HP3HP4HP3HP3

HPH4PHPHPHPH2 2160 p2H3PH8 p3H10PHP3H12p4H6PH2PHP 3664 H12PHPHp2H2 p2H2 p2Hp2H2P2H2P2H

P2H2P2H2p2HPHPH12 4285 HPHPH16p4H12p6H12p3H12p3H12p3H

p2H2p2H2p2HPH 52

Due to the stochastic nature of DE, for each test instance,100 runs were done, using different random seeds. Resultsreported are the average values over these 100 runs. For allexperiments, the following parameters were used: populationsize number of amino acids x 15; crossover probabilityCR 80%; weighting factor F = 0.85. Also, we usedthe alternating strategies Best2Exp and Rand2Exp, as

explained before, with N = 100 and M = 70. The constantsthat define the ranges for mapping the genotype to thephenotype were: av =-3, /3 =-1, d = +1, a = +3. Thesoftware was developed in C programming language and allexperiments were run in a PC computer with Athlon X2 64bits processor with 1 Gbytes RAM.

Table II presents the results obtained by our approach andthe comparison with other. In this table, the first and secondcolumns represent, respectively, the number of amino acidsand the maximum number of non-local H-H contacts knownto date. The next two columns are the best results foundby PERM, a Monte Carlo-based algorithm [2], and by an

Ant Colony Optimization algorithm [4], respectively. Thefifth and sixth columns are the results obtained by Lopesand Scapin [3], using an genetic algorithm with enhancedoperators. The fifth column shows the maximum number ofnon-local H-H contacts found by the algorithm and, withinparenthesis, the number of times this maximum was found in100 independent runs. The sixth column shows the average

number of non-local H-H contacts in the 100 runs. The lasttwo columns show the results obtained by the DE algorithmdescribed in this paper. The meaning of these columns are

improved or M generations without improvement has done, the same as the fifth and sixth columns.

TABLE IICOMPARISON OF RESULTS USING DIFFERENT APPROACHES.

#aa Emax

20 924 925 836 1448 2350 2160 3664 4285 53

PERMmax

998

142321364252

ACOmax

998

142321364253

GA DEmax avg max

9(74x) 8.74 9(lOOx)9(lOOx)8(1OOx)

14(6x) 12.44 14(96x)23(2x) 20.06 23(100x)

21(100x)35(79x)

40(lx) 33.58 42(88x)51(2x) 45.74 52(50x)

avg

9.009.008.0013.9623.0021.0034.7941.8751.38

VI. DISCUSSION AND CONCLUSIONS

For chains up to 50 amino acids, all algorithms havefound the maximum number of non-local H-H contacts.However our DE algorithm was much more consistent thanthe GA algorithm, since it achieved the maximum in all runs

of all chains (except for the 36 amino acids-long chain).There are no information for PERM and ACO to compare

with our approach, regarding this issue. For the chain with64 amino acids, all algorithms achieved the maximum, butDE performed much better than the GA, regarding any

parameter. For both chains, with 60 and 85 amino acids,our DE did not achieved the maximum, when compared withthe ACO, but PERM didn't too for the 85 amino acids chain.However, it is remarkable the consistency of the algorithm,when observing not only the average, but also, the number oftimes the maximum was found for all instances. This fact isvery important for a stochastic algorithm, and suggests thatDE has a good repetibility.We have proposed a methodology for using the differential

evolution algorithm for the protein folding problem with the2D-HP model. The DE algorithm was kept as originally de-scribed by [8], and we introduced the concept of genotypical-phenotypical mapping. Thanks to this mapping, the DEalgorithm, originally devised for real-valued vectors, couldbe used for evolving solutions to the PFP. Considering thatthe selection method used is based only in the fitness function(which is based on the phenotypical representation), it ispossible that promising individuals (seen at the genotypicallevel) could be discarded along generations. Other implica-tions of the proposed genotypical-phenotypical mapping are

still under study and will be focused in future work.It is important to note that no serious attempts were done

to optimize parameters of the algorithm, neither to adjust therange constants defined in equation 2. As a consequence, itis reasonable to think that better results (than those shownin table II could be achieved, or the same results could

be achieved with smaller computational effort. Besides, were-emphasize that we used the basic DE, while the otherresults cited were obtained with much more elaborated andimproved versions of the algorithms (PERM, ACO and GA).As the length of the amino acids chain increases, the

problem is getting harder. In fact, the lattice model andthe energy function (equation 2), based only on the numberof non-local H-H contacts, leads to a strongly multimodalfitness landscape with many equal-sized plateaus. This fact,by itself, makes the problem even harder for any stochasticheuristic method. Even so, the DE approach seems to bepromising.

Protein folding using the 2D-HP model is an important,and still opened problem, in bioinformatics. We believe thatthe proposed algorithm is an useful contribution to this areaof research. In the next future we intend to do more extensiveexperiments as well as to modify the classical DE withspecial strategies specially suited for this problem.

REFERENCES

[1] A.R. Leach, Molecular Modelling: Principles and Applications, 2nded. Dorset, Prentice-Hall, 2001.

[2] G. Chikenji, M. Kikuchi and Y. Iba, "Multi-self-overlap ensemble forprotein folding: ground state search and thermodynamics," PhysicalReview Letters, vol. 83, no. 9, pp. 1886-1889, 1999.

[3] H.S. Lopes and M.P. Scapin, "An enhanced genetic algorithm forprotein structure prediction using the 2D hydrophobic polar model,"Lecture Notes in Computer Science, vol. 3871, pp. 238-246, 2005.

[4] A. Shmygelska and H.H. Hoos, "An improved ant colony optimisationalgorithm for the 2D HP protein folding problem," Lecture Notes inComputer Science, vol. 2671, pp. 400-417, 2003.

[5] K.A. Dill, "Theory for the folding and stability of globular proteins,"Biochemistry, vol. 24, pp. 1501-1509, 1985.

[6] H. Li, C. Tang and N.S. Wingreen, "Nature of driving force for proteinfolding: A result from analyzing the statistical potential," PhysicalReview Letters, vol. 79, no. 4, pp. 765-768, 1997.

[7] H. Li, R. Helling, C. Tang and N.S. Wingreen, "Emergence ofpreferred structures in a simple model of protein folding," Science,vol. 273, no. 5275, pp. 666-669, 1996.

[8] R.M. Storn and K.V. Price, "Differential Evolution- a simple andefficient adaptive scheme for global optimization over continuousspaces," Technical Report TR-95-012, International Computer ScienceInstitute, Berkeley, USA, 1995.

[9] R.M. Storn and K.V. Price, "Differential evolution - a simple andefficient heuristic for global optimization over continuous spaces,"Journal of Global Optimization, vol. 11, no. 4, pp. 341-359, 1997.

[10] L.S. Coelho and H.S. Lopes, "Supply chain optimization using chaoticdifferential evolution method," In Proc. IEEE Systems, Man andCybernectics Conference, Taipei, Taiwan, 2006.

[11] K.V. Price, R.M. Storn and J.A. Lampinen, "Differential Evolution - APractical Approach to Global Optimization", Berlin: Springer-Verlag,2005.

[12] A. Piccolboni and G. Mauri, "Application of evolutionary algorithmsto protein folding prediction," Lecture Notes in Computer Science, vol.1363, pp. 123-136, 1998.

[ieee 2006 ieee symposium on computational intelligence and bioinformatics and computational biology...

Documents