a molecular dynamics and knowledge-based computational strategy to predict native-like structures of...

Expert Systems with Applications 40 (2013) 698–706

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A molecular dynamics and knowledge-based computational strategyto predict native-like structures of polypeptides

Márcio Dorn ⇑, Luciana S. Buriol, Luis C. LambFederal University of Rio Grande do Sul, Institute of Informatics, Av. Bento Gonçalves 9500, 91501-970 Porto Alegre, RS, Brazil

a r t i c l e i n f o a b s t r a c t

Keywords:Structural bioinformaticsMolecular dynamics simulationProtein structure predictionAb initio structure prediction

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2012.08.003

⇑ Corresponding author. Tel.: +55 5184398410.E-mail address: [email protected] (M. Dorn).

One of the main research problems in structural bioinformatics is the prediction of three-dimensionalstructures (3-D) of polypeptides or proteins. The current rate at which amino acid sequences are identi-fied increases much faster than the 3-D protein structure determination by experimental methods, suchas X-ray diffraction and NMR techniques. The determination of protein structures is both experimentallyexpensive and time consuming. Predicting the correct 3-D structure of a protein molecule is an intricateand arduous task. The protein structure prediction (PSP) problem is, in computational complexity theory,an NP-complete problem. In order to reduce computing time, current efforts have targeted hybridizationsbetween ab initio and knowledge-based methods aiming at efficient prediction of the correct structure ofpolypeptides. In this article we present a hybrid method for the 3-D protein structure prediction problem.An artificial neural network knowledge-based method that predicts approximated 3-D protein structuresis combined with an ab initio strategy. Molecular dynamics (MD) simulation is used to the refinement ofthe approximated 3-D protein structures. In the refinement step, global interactions between each pair ofatoms in the molecule (including non-bond interactions) are evaluated. The developed MD protocolenables us to correct polypeptide torsion angles deviation from the predicted structures and improvetheir stereo-chemical quality. The obtained results shows that the time to predict native-like 3-D struc-tures is considerably reduced. We test our computational strategy with four mini proteins whose sizesvary from 19 to 34 amino acid residues. The structures obtained at the end of 32.0 nanoseconds (ns) ofMD simulation were comparable topologically to their correspondent experimental structures.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Currently, one of the main research problems in structural bio-informatics is associated to the study and prediction of protein 3-Dstructures. Proteins are long sequences of 20 different amino acidresidues that in physiological conditions adopt a unique 3-D struc-ture (Anfinsen, Haber, Sela, & White, 1961). To understand thefunctions of proteins at a molecular level it is often necessary todetermine their three-dimensional structure (Branden & Tooze,1998). Knowledge of the protein structure allows the investigationof biological processes more directly, with higher resolution and fi-ner detail.

The 1990s genome projects resulted in a large increase in thenumber of protein sequences. However, the number of identified3-D protein structures have not followed the same growth trend.Currently, the number of protein sequences is much higher thanthe number of known 3-D structures. If one compares the numberof non-redundant sequences of protein sequences stored in Gen-

ll rights reserved.

bank (Benson, Karsch-Mizrachi, Lipman, Ostell, & Wheeler, 2009)against the number of 3-D protein structures with distinct foldsstored in the Protein Data Bank (PDB) (Berman et al., 2000), we ob-serve a large discrepancy. Clearly, there is a large gap between thenumber of protein sequences one can generate and the number ofnew proteins folds one can determine by experimental methodssuch as X-ray diffraction and nuclear magnetic resonance (NMR).The determination of protein structure is both experimentallyexpensive and time consuming, which explain current efforts tothe development of computational strategies to predict the correct3-D protein structure from extended or full amino acid sequences.

The prediction of the 3-D structure of polypeptides based onlyon their amino acid sequence (primary structure) is a problem thathas, over the past 40 years, challenged computer scientists, bio-chemists, mathematicians and biologists. A number of computa-tional methodologies, systems and algorithms have beenproposed to address the protein structure prediction (PSP) prob-lem. However, the problem still remains challenging because ofthe complexity and high dimensionality of a protein conforma-tional search space (Levinthal, 1968). The main challenge is tounderstand how the information encoded in the linear sequenceof amino acid residues is translated into the 3-D structure,

http://dx.doi.org/10.1016/j.eswa.2012.08.003

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2012.08.003

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

M. Dorn et al. / Expert Systems with Applications 40 (2013) 698–706 699

and from this acquired knowledge, to develop computationalmethodologies that can correctly predict the native structure of aprotein molecule. There are four classes of computational methodsfor the PSP problem (Floudas, Fung, McAllister, Moennigmann, &Rajgaria, 2006): (I) first principle methods without database infor-mation (Osguthorpe, 2000); (II) first principle methods with data-base information (Rohl, Strauss, Misura, & Baker, 2004; Srinivasan& Rose, 1995); (III) fold recognition methods (Bryant & Altschul,1995; Jones, Taylor, & Thornton, 1992; Turcotte, Muggleton, &Sternberg, 1998); and (IV) Comparative Modelling methods(Martì-Renom et al., 2000; Sánchez & Sali, 1997). Despite the pro-gress in recent decades these methodologies have limitations.Group IV can only predict structures of protein sequences whichare similar or nearly identical to protein sequences with knownstructures. Group III is limited to the fold library derived fromPDB. Group I can obtain novel structures with new folds. However,the complexity and high dimensionality of the search space evenfor a small protein molecule make the problem intractable(an NP-Complete problem) (Crescenzi, Goldman, Papadimitriou,Piccolboni, & Yannakakis, 1998; Hart & Istrail, 1997; Ngo, Marks,& Karplus, 1997). Recently, the methods belonging to group II haveachieved the best results in their predictions in the CASP1 (criticalassessment of protein structure prediction experiment). Severalefforts towards the development of hybrid methods combining abinitio with knowledge-based methods have been done over the last10 years (Robustelli, Cavalli, & Vendruscolo, 2008).

In this article, we investigate the use of classical moleculardynamics simulations (Hansson, Oostenbrink, & van Gunsteren,2002; Karplus & McCammon, 2002; van Gunsteren & Berend-sen, 1990), performed in explicit water for the refinement ofapproximated structures of proteins generated by knowledge-based methods. Approximated 3-D structures are built usingan artificial neural network strategy described before by (CReFDorn & Norberto de Souza, 2008 and A3N Dorn & Norbertode Souza, 2010). In the refinement step, global interactions be-tween all atoms in the molecule (including, e.g. non-bondinteractions) are evaluated and deviations in the polypeptidetorsion angles are corrected (Fan & Mark, 2004). This in turnreduces the total time of ab initio methods (Bonneau & Baker,2001; Hardin, Pogorelov, & Luthey-Schulten, 2002; Sternberg,Bates, Kelley, & MacCallum, 1999), which usually start from afully extended conformation (Breda, Santos, Basso, & Norbertode Souza, 2007) of a polypeptide to fold a sequence of un-known structure. In cases of high sequence homology, the basicframework of the protein can normally be predicted with highaccuracy. Nevertheless, errors still occur in variable loops, therelative orientations of secondary structure elements, and inthe details of atomic packing. Even small errors in critical re-gions, however, are sufficient to prevent the use of models insensitive applications such as in rational drug design and theprediction of protein–protein interactions. ab-initio methodswhen used as refinement steps in protein structure predictiontasks are capable of correcting this errors (Dorn, Breda, &Norberto de Souza, 2008; Fan & Mark, 2004; Karplus &McCammon, 2002).

The remainder of the paper is structured as follows. Section2 describes and presents: (a) the knowledge-based method em-ployed to predict approximate 3-D structures; (b) the modeland target proteins, (c) the MD simulation protocol and (d)the employed strategy for the structural analysis. Section 3present the results and the discussion of the obtained results.Section 4 concludes and points out directions for furtherresearch.

1 www.predictioncenter.org.

2. Materials and methods

2.1. Peptides, representation and prediction of approximate 3-Dstructures

A peptide is a molecule composed of two or more amino acidresidues chained by a chemical bond called the peptide bond. Thispeptide bond is formed when the carboxyl group of one residue re-acts with the amino group of the other residue, thereby releasing awater molecule (H2O). A polypeptide can be represented by a set Xof atoms in the 3-D space (R3) (Eq. 1).

X ¼ ½a1; a2; . . . ; an�; ð1Þ

where n is the total number of atoms in the molecule. The geometryof a polypeptide structure is described by assigning to each ith atoma 3-dimensional coordinate vector ai

! (Eq. 2).

ai!¼ ðai:x; ai:y; ai:zÞ ð2Þ

Two atoms am�! and an

�! that are joined by a chemical bond canbe represented as a bond vector~s (Eq. 3). The length of the bondvector can be computed with the Euclidean norm (Eq. 4).

~s ¼ am�!� an

�! ð3Þ

k~sk ¼ffiffiffiffiffiffiffiffiffiffið~s;~sÞ

q¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffis2

x þ s2y þ s2

z

qð4Þ

For two adjacent bonds ab!� ac

! and ad!� ae

!, we have the threebond vectors (Eq. 5) and the bond angle formed by the bonds be-tween the atoms ab

!� ac!� ad

! can be computed from the Eq. (6).

~u ¼ ac!� ab

!; ~v ¼ ae

!� ad!; ~r ¼ ad

!� ac!; ð5Þ

sink~u�~rkk~ukk~rk ; ð6Þ

where ~u�~r is determined by Eq. (7) (cross product in R3).

~u�~r ¼uyrz � uzry

uzrx � uxrz

uxry � uyrx

0B@

1CA ð7Þ

Similarly the bond angle formed by the bonds between atomsac!� ad

!� ae! can be computed by Eq. (8).

sink~v �~rkkvkkrk ; ð8Þ

where ~v �~r is determined by Eq. (9).

~v �~r ¼vyrz � vzry

vzrx � vxrz

vxry � vyrx

0B@

1CA ð9Þ

The bond angles around the polypeptide backbone N � Ca (/,Eq. 6) and Ca � C (w, Eq. 8) are the mostly responsible for the back-bone adopt a specific conformation (Branden & Tooze, 1998; Lesk,2002). Consecutive / and w angles represent the internal rotationsof a protein main-chain.

In this work we implement and use the A3N (Dorn & Norbertode Souza, 2010) method to acquire structural information fromexperimental determined proteins and predict the / and w torsionangles of the target protein backbone. A3N is a fragment-basedmethod to predict approximate native-like protein structures onlyfrom the amino acid sequence of the target protein. Below a veryshort description of the A3N method is provided. A completedescription of A3N method can be found in (Dorn & Norberto deSouza, 2010). The A3N method is composed of 9 main steps:

http://www.predictioncenter.org

Fig. 1. Secondary structure organization. (A) The 34 amino acid residues of 1ZDD;(B) The 18 amino acid residues of 1ALE; (C) the 29 amino acid residues of 1ARE and(D) the 25 amino acid residues of 1A11 from N-terminus (left) to C-terminus (right).Helices, Coils and Strands are represented as color boxes (Helices in red, Strands inyellow and Coils in green). The secondary structure analysis was performed byPROMOTIF (Hutchinson & Thornton, 1996). Undefined regions are represented bycode ‘‘c’’. Green color represents the coils, turns and undefined regions; yellowrepresents the strand secondary structures regions, and red represents the helicesregions. (For interpretation of the references to colour in this figure legend, thereader is referred to the web version of this article.)

2 Structural classification of proteins. scop.mrc-lmb.cam.ac.uk/scop (accessed

700 M. Dorn et al. / Expert Systems with Applications 40 (2013) 698–706

1. Generating amino acid n-grams: the target amino acidsequence is fragmented into consecutive amino acid frag-ments. Fragments with five, seven, nine and eleven aminoacid residues are generated. For each target sequence a setof Sn = {si,si+1, . . . ,sp} n-grams is built, where si and sp arethe first and the last n-gram, respectively.

2. Collecting protein templates from the experimental data-base: a search procedure in the PDB (Berman et al., 2000)is performed for each target amino acid n-gram in order toidentify structural templates. The search procedure is per-formed using the BLASTp algorithm with a BLOSUM62 sub-stitution matrix and a cutoff equal to 10.0 (E-value). At theend for each si n-gram 2S, a set si = {t1, t2, . . . , tn} of candidatetemplates is identified. All identified template proteins aredownload from the PDB.

3. Generating secondary structure information from tem-plates: the secondary structure information of each tem-plate ti obtained from the PDB is calculated usingPROMOTIF (Hutchinson & Thornton, 1996). At the end ofthis stage, a library Tss = tss1, tsss, . . . , tssm} of secondarystructure fragments is obtained for each si n-gram gener-ated in step 1.

4. Generating / and w torsion angles: only the informationfrom the central amino acid residue from the templates frag-ments are considered for analysis. In this step the torsionangles values for / and w for each template fragment ti arecalculated.

5. Statistical analysis of secondary structure data and second-ary structure prediction: the secondary structure informa-tion obtained in step 3 is analyzed through a statisticalfunction in order to predict the secondary structure of thetarget protein sequence.

6. Clustering torsion angles: torsion angles / and w calculatedfor each si n-gram in step 4 are clusterized. A clustering algo-rithm is applied in order to identify similar correlatedtemplates in specific regions of the Ramachandran plot(Ramachandran & Sasisekharan, 1968). Each Ramachandranregion represents a class of conformational states. A classicalK-means algorithm is used to clustering the proteintemplates.

7. Building classes and conformational patterns for each tem-plate fragment: a mapping function is used to create trainingpatterns for each amino acid residue from the targetsequence. A training pattern has the form tssi : wj, where, tssi

is the secondary structure information of a template frag-ment ti 2 si and wj is the cluster/class identified during theclustering which the template ti belongs.

8. Building and training ANNs: for each amino acid residuefrom the target sequence an MLP (multi-layer perceptron)(Haykin, 1998) artificial neural networks is developed inorder to predict its torsion angles / and w. The architectureof each neural network is composed by 1 input layer, 4 hid-den layers and 1 output layer. Each hidden layer has 10 neu-rons. For the training phase a learning rate of 0.02 and anumber of max epochs equal to 50,000 are used. Initialweights are randomly generated.

9. Predicting / and w torsion angles for each amino acid resi-due: after training each artificial neural network the torsionangles are predicted for each amino acid residue of thetarget sequence. At the end of this stage a set of main-chaintorsion angles (/ and w) is obtained.

10. Building the 3-D structures of the target sequence: from theset of torsion angles obtained in step 8 rotations in the main-chain polypeptide structure are performed using Eq. (6) for /and Eq. (8) for w.

2.2. Model and target proteins

The amino acid sequence of four mini proteins are obtained fromthe PDB (Berman et al., 2000) and used as study cases in our exper-iments: 1ZDD (Starovasnik, Braisted, & Wells, 1997) (Fig. 2(A)/Cyan), 1ALE (Rozek, Buchko, & Cushley, 1995) (Fig. 2(B)/Cyan),1ARE (Hoffman, Horvath, & Klevit, 1997) (Fig. 2(C)/Cyan) and1A11 (Opella et al., 1999) (Fig. 2(D)/Cyan). Fig. 1 presents the sec-ondary structure organization of each one of the tested proteins.Secondary structure analysis were performed by PROMOTIF(Hutchinson & Thornton, 1996). These study cases were selectedin order to test our method with different classes of polypeptideswith different folding patterns. These same used cases were presentin Dorn and Norberto de Souza (2010).

The polypeptide 1ZDD is a disulfide-stabilized mini proteincomposed of 34 amino acid residues (Fig. 1(A)) known to be ar-ranged as two a-helices connected by a turn, a structural motifknown as an a-helical hairpin. 1ZDD is classified by SCOP2 (Murzin,Brenner, Hubbard, & Cothia, 1995) as a designed-protein. 1ALE is apeptide (SCOP) composed by 18 amino acid residues (Fig. 1(B)) pre-senting only a a-helix regular structure. 1ARE is a small protein(SCOP) composed by 29 amino acid residues (Fig. 1(C)) known bythe arrangement of one a-helix and two b-strands. 1A11 is a peptide(SCOP) composed by 25 amino acid residues (Fig. 1(D)).

2.3. Molecular dynamics simulations

We developed an MD protocol to refine the 3-D protein struc-tures determined by the knowledge-based method described in Sec-tion 2.1. We start the refinement procedure using as input theapproximated polypeptide chain generated by the knowledge-based method. We solvated the polypeptide in a water box usingthe GENBOX program from GROMACS with a Cubic box type and0.9 Å as the minimum distance between the solute and the box. En-ergy minimization and MD simulation was performed by MDRUNprogram of the GROMACS package (Hess, Kutzner, van der Spoel, &

ar 01, 2012).
M

Fig. 2. Ribbon representation of the experimental (cyan), approximated 3-D structure predicted by A3N (magenta) and final structure archived after the refinementprocedure by MD simulation (Yellow). The Ca of the experimental, approximated 3-D structure and the refined structure are fitted. (A) PDB ID = 1ZDD. (B) PDB ID = 1ALE. (C)PDB ID = 1ARE. (D) PDB ID = 1ALE. Amino acid side chains are not shown for clarity. Graphic representation was prepared with PYMOL. (For interpretation of the references tocolour in this figure legend, the reader is referred to the web version of this article.)


Lindahl, 2008; van der Spoel et al., 2005) using the GROMOS96FF(G43a1) United atom force field.3 Firstly, we submit the approxi-mated structure to an energy minimization step using PME (parti-cle-mesh Ewald) electrostatic (Darden, York, & Pedersen, 2009;Toukmaji & Board, 1996) as Coulomb type parameter and van derWaals cut-off radius equal to 1.4 Å. The polypeptide was energy min-imized for 500 steps in order to relax any possible strains generated bythe knowledge-based method (A3N). The energy minimization proce-dure ensured that we have a reasonable starting structure in terms ofgeometry and solvent orientation. In a second stage (namely here NVTstage) the minimized structure was submitted to a equilibration stagein order to equilibrate the solvent and ions around the protein. Thisstep stabilized the temperature of the system. After a the NVT stagethe pressure and the density of the system must be stabilized. Theequilibration of pressure is conducted under an NPT ensemble, where-in the number of particles, pressure, and temperature are all constant.

After the minimization and equilibration phases the minimizedstructures were submitted to a 32 ns MD simulation at a tempera-ture of 300.0 K and 1.4 Å cut-off Radius for the evaluation of thelong-range van der Waals and electrostatic interactions. Duringthe MD simulation all bonds are fixed as constraints. The GRO-MOS96FF united atom force field was used. The simulation wasperformed on PC computer running Linux and GROMACS version4.5.4. Simulations took about 600 h of processing time.

2.4. Structural analysis

The qualities of the predicted structures were evaluated by sim-ilarity comparisons with the structures of the experimentalproteins obtained from the PDB (Eq. 10). Quality measures havebeen made in terms of the root mean square deviation (RMSD)between the position of the Ca atoms of the predicted and theexperimental structures. The RMSD measure was calculated usingthe PROFIT software (McLachlan, 1992). Stereo-chemical andsecondary structure analysis were performed with PROCHECK(Laskowski, MacArthur, Moss, & Thornton, 1993) and PROMOTIF(Hutchinson & Thornton, 1996). All Ribbon illustrations weregenerated by PYMOL.4

RMSDða; bÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼1

krai � rbik2

!,n

vuut ; ð10Þ

were rai and rbi are vectors representing the positions of the sameatom i in each of two structures, a and b respectively, and wherethe structures a and b are optimally superimposed.

3 Biomolecular Simulation: The GROMOS96 Manual and User Guidewww.gromos.net.

4pymol.org.

.

The convergence of the MD simulation protocol was monitoredbased on: (a) the total energy fluctuation during the simulation; (b)the RMSD (root mean square deviation) of each approximatedpolypeptide structure during the MD trajectory with respect totheir experimental structure; and (c) the secondary structure andstructural fluctuation of the target structure during the simulation.

3. Results and discussion

3.1. Building approximate 3-D structures for the target sequences

The target sequences of the study cases described in Section 2.2were submited to A3N method and its approximate 3-D structure ispredicted. A3N considers only templates from the PDB which haveno evolutionary relationship with the target sequence K. Thus, allPDB templates identical or closely-related (50% identity) to the tar-get sequence, over their full length, are removed. All parameters forthe A3N algorithm are kept constant for each case study. The clus-tering algorithm parameters are fixed as: seed = 100,000, maxItera-tions = 100, minStdDev = 10�6. The structure of the artificial neuralnetwork is composed of an input layer, four hidden layers with 10neurons each, and an output layer with three neurons. The trainingparameters are the same for all case studies: learning rate = 0.02,maxepochs = 50,000, EBR (epochs between reports) = 100. Weightsare selected randomly the interval [0.01: 0.02].

The target sequence K = FNMQCQRRFYEALHPNLN EEQRNAKIK-SIRDDC (Fig. 1(A)) of the protein with PDB ID = 1ZDD (Starovasniket al., 1997) was fragmented into 84 short si contiguous n-grams.For each si fragment PDB templates were searched using BLASTp.All PDBs whose sequences were similar or identical to 1ZDD,namely, 1ZDC, 1ZDD, 1L6X, 1OQO, 1OQX, 1ZDA, 1ZDB, 2SPZ,1LP1, 1Q2N, 1FC2, 1BDC, 1BDD, 1SS1, 1DEE, 1EDK, 1EDJ, 1EDI,1EDL were removed. This eliminate any bias due to sequences ofknown structures very closely-related to 1ZDD. Using the PDB tem-plates, / and w pairs, and the secondary structure information, theapproximated 3-D structure of PDB was predicted (Fig2(A)/magen-ta). The protein with PDB ID = ALE (Rozek et al., 1995) is a peptidecomposed of 18 amino acid residues (Fig. 1(B)). Its target sequenceK = ALDKLKEFGNTLEDKARE was fragmented into 44 si n-grams. Foreach si fragment the PDB was searched for templates. TemplatePDBs whose sequences were similar or identical to 1ALE were re-moved: 1ALE. The approximated 3-D structure of the peptide1ALE was predicted (Fig. 2(B)/magenta).

The Protein with PDB ID = 1ARE (Hoffman et al., 1997) is com-posed by 29 amino acid residues: K = RSFVCEVCTRAFARQEALKR-HYRSHTNEK (Fig. 1(C)). Its sequence was fragmented into 88 si

n-grams. The PDB was searched in order to identify templates.PDBs whose sequence were identical to 1ARE were removed:1ARE. The approximated 3-D structure (Fig. 2(C)/magenta) was

http://www.gromos.net

Fig. 3. Ramachandran plots of the experimental, predicted and refined 3-D structures: (A.1) Ramachandran plot of the experimental 3-D structure of protein with PDB ID1ZDD. (A.2) Ramachandran plot of the approximated 3-D structure predicted by A3N of the protein with PDB 1ZDD. (A.3) presents the Ramachandran plot of the refined 3-Dstructure of 1ZDD. (B.1) Ramachandran plot of the experimental 3-D structure of 1ALE. (B.2) Ramachandran plot of the approximated 3-D structure of 1ALE predicted usingA3N. (B.3) MD refined 3-D structure of 1ALE. (C.1) Ramachandran plot of the experimental 3-D structure of the protein with PDB ID 1ARE. (C.2) Ramachandran plot of theapproximated 3-D structure of 1ARE. (C.3) Ramachadran Plot of the MD refined 3-D structure of 1ARE. (D.1) Ramachandran plot of the experimental 3-D structure of 1A11.(D.2) Ramachandran plot of approximated 3-D structure of 1A11 predicted by A3N. (D.3) Ramachandran plot of the MD refined structure of 1A11. Graphical representationswere prepared with PROCHECK.

Table 1Numerical Ramachandran plot values for the experimental (-E), 3-D predicted (-P)structure by A3N and the refined 3-D structures (-R). Numerical values weregenerated by PROCHEK and are expressed in %. A – most favorable region; B –additional allowed region; C – generously allowed region; D – disallowed region.

Protein PDB ID A B C D

1ZDD-E 87.1 12.9 0.0 0.0

1ZDD-A 93.5 6.5 0.0 0.0

1ZDD-R 87.1 9.7 3.2 0.0

1ALE-E 93.3 6.7 0.0 0.0

1ALE-A 100.0 0.0 0.0 0.0

1ALE-R 93.3 6.7 0.0 0.0

1ARE-E 66.7 25.9 3.7 3.7

1ARE-A 81.5 18.5 0.0 0.0

1ARE-R 55.6 44.4 0.0 0.0

1A11-E 91.3 4.3 4.3 0.0

1A11-A 95.7 0.0 4.3 0.0

1A11-R 91.3 4.3 0.0 4.3


predicted. The transmembrane helical fragment with PDBID = 1A11, composed of 25 amino acid residues, K = GSEKM-STAISVLLAQAVFLLLTSQR (Fig. 1(D)), was fragmented into 72 shortn-grams. The PDB was searched for templates to each si fragment.All PDBs whose sequence was identical to 1A11 were removed:1CEK, 1A11, 10ED, 1EQ8, 2BG9. The approximated 3-D structurewas predicted (Fig. 2(D)/magenta).

We run PROCHECK in order to analyse the patterns of hydrogenbonds that define the secondary structure of the predicted struc-tures. This analysis reveals that the secondary structure of theapproximated structures predicted by A3N are comparable to theirexperimental structures. The predicted structure of 1ZDD presents73.5% (against 73.5% of the experimental 3-D structure) of the ami-no acid residues in a a-helix state and 26.5% (against 26.5% of theexperimental 3-D structure) representing other irregular struc-tures. The secondary structure of the predicted 1ALE presents83.3% (against 77.8% of the experimental 3-D structure) of the ami-no acid residues in a a-helix state and 16.7% (against 22.2% of theexperimental structure) of the amino acid residues are other irreg-ular structures. The approximated 3-D structure of 1ARE presents24.1% (against 27.6% of the experimental structure) of their aminoacid residues in a a-helix conformational state, 10.3% (against 6.9%

of the experimental structure) of the amino acid residues in a 310-helix state and the remaining amino acid residues, 65.5% (against65.5% of the experimental structure) representing other irregular

Table 2Ca root mean square deviation (RMSD) of approximated structures generated by A3Nand refined structures with respect to their experimental structures. Refinedstructure refers to the obtained structure after 32 ns of MD simulation. RMSD valueswere calculated by PROFIT and are expressed in Angstroms (Å).

PDB ID Region Resid. A3N Refined

1ZDD Helix I 3–14 0.43 0.43

1ZDD Coil II 15–18 0.35 0.71

1ZDD Helix II 19–32 0.29 0.37

1ZDD Full 3–32 1.67 1.43

1ALE Helix I 3–16 0.50 0.55

1ALE Full 3–16 0.50 0.55

1ARE Strand I 3–5 0.04 0.44

1ARE Coil II 6–9 0.70 0.61

1ARE Strand II 10–12 0.43 0.43

1ARE Coil III 13 0.0 0.0

1ARE Helix I 14–23 1.65 1.91

1ARE Coil IV 24–27 2.35 0.56

1ARE Full 3–27 5.68 6.29

1A11 Full 3–23 1.22 1.27

1A11 Full 3–23 1.22 1.27


structures. The predicted structure of 1A11 presents 80.0% (against92.0% present in the experimental 3-D structure) of their aminoacid residues in a a-helix state and 20.0% (against 8.0% present inthe experimental 3-D structure) as irregular structures.

The distribution of the amino acid residues in the Ramachan-dran plot and the stereo-chemical quality of the 3-D structures

Fig. 4. Secondary structure representation and structural fluctuation during the refinestructural fluctuation along the 32 ns MD refinement step; (B) the 18 amino acid resirefinement step; (C) the 29 amino acid residues of the protein PDB ID 1ARE and the structresidues of the protein PDB ID 1A11 and the structural fluctuation along the 32 ns of thinformation calculated by PROMOTIF (Hutchinson & Thornton, 1996) and the graphical

predicted by A3N were also analysed: 1ZDD (Ramachandran plot– Fig. 3(A.2)), 1ALE (Ramachandran plot – Fig. 3(B.2)), 1ARE (Rama-chandran plot – Fig. 3(C.2)) and 1A11 (Ramachandran plot –Fig. 3(D.2)). We observe that in all of 3-D predicted structures,the amino acid residues are located in the most favorable regionsof the map (favorable or additional allowed region) (Table 1).When we compare the results obtained with the 3-D structure pre-dicted by A3N against the experimental structures (Fig. 3(A.1)(1ZDD), Fig. 3(B.1) (1ALE), Fig. 3(C.1) (1ARE), and Fig. 3(D.1)(1A11)) we observe that these structures are similar in terms ofstereo-chemical quality with the presence of some bad contacts.Helices structures in the predicted structure are well formed andare similar to the experimental structures. Table 2, Column 4, pre-sents the Ca root mean square deviation (RMSD) of the 3-D approx-imate structure predicted by A3N with respect to theirexperimental structures. We observe that for case studies 1ZDD,1ALE, 1ARE and 1A11 we obtain accurate results (1.67 Å, 0.5 Å,5.68 Å and 1.22 Å respectively). The case study 1ARE presents high-er RMSD (5.26 Å). This result is somewhat expected given that1ARE has a more complex folding pattern when compared to theother test cases.

3.2. Analysis of the MD simulation trajectories

The approximated 3-D structures predicted by A3N were usedas starting point structures in an MD simulation protocol (See Sec-

ment step. (A) The 34 amino acid residues of the protein PDB ID 1ZDD and thedues of the protein PDB 1ALE and the structural fluctuation along the 32 ns MDural fluctuation along the 32 ns of the MD refinement step and (D) the 25 amino acide MD refinement step. Script analysis was developed using the secondary structurerepresentations were generated using GNUPLOT.

Fig. 5. RMSD fluctuation. (A) Ca root mean square deviation of the protein PDB ID 1ZDD through 32 ns of MD refinement step; (B) Ca RMSD of the protein PDB ID 1ALE alongthe MD simulation; (C) Ca root mean square deviation of the protein PDB ID 1ARE along the 32 ns of MD refinement step and (D) Ca root mean square deviation of the proteinPDB ID 1A11 along the 32 ns of MD refinement step. The RMSD analysis was performed by PROFIT. Graphical representations were prepared by GNUPLOT.


tion 2.3). Polypeptides were solvated in a water box using the GEN-BOX program from GROMACS with a Cubic box type and 0.9 Å asthe minimum distance between the solute and the box. Eachapproximated 3-D polypeptide structures was energy-minimizedfor 500 steps in order to relax any possible strains generated bythe A3N method. Energy minimization, equilibration and produc-tion phases of the MD simulations were performed. During theMD simulation all bonds are fixed as constraints. After the minimi-zation and equilibration steps (temperature and pressure) thestructure was submitted to a 32 ns MD simulation at a tempera-ture of 300.0 K and 1.4 Å cut-off Radius for the evaluation of thelong-range van der Waals and electrostatic interactions. The GRO-MOS96FF united atom force field was used. Snapshots were col-lected for analysis at every 1 ns.

At the end of a 32 ns MD simulation, 1ZDD adopts a topology(Fig. 2(A)/Yellow) similar to their experimental 3-D structure(Fig. 2(A)/Cyan). Along the MD simulation trajectory the secondarystructure patterns of each snapshot were analyzed using PROMO-TIF. Fig. 4(A) presents the secondary structure fluctuation alongthe refinement of the approximated 3-D structure of 1ZDD. Alongthe refinement of 1ZDD the secondary structure elements remainpartially constant. The secondary structure presents some changesfrom 16 ns until 25 ns in the amino acid residues that participate inthe coil I (S and t) of the 1ZDD (Fig. 1(A)). After 26 ns the conforma-tion remains stable. In terms of energy, the approximated 3-Dstructure of 1ZDD after the minimization procedure presents a to-tal energy of �1.42061e+05 kJ/mol after the equilibration and 32 nsof the production phase the final energy of the refined structure is

�1.24377e+05 kJ/mol. Fig. 6(A) presents the total energy fluctuationof the refinement of the 3-D structure of 1ZDD along the MD sim-ulation. As can be observed the total energy of the approximated 3-D structure is minimized along the simulation. Errors in the atomspositions of the approximated structure are corrected along thesimulation and bad contacts were eliminated (Table 1). In the sameway Fig. 5(A) illustrates the Ca RMSD fluctuation during the refine-ment step. Analyzing the total energy fluctuation graphic and theRMSD fluctuation is possible to observe that the refinement proce-dure improves the quality of the approximated structure predictedby the A3N. At the end of the refinement step 1ZDD presents Ca of1.43 Å with respect the experimental structure, against 1.67 Å ofthe approximated structure predicted by A3N. The analysis of theamino acid residues distribution in the Ramachandran plot(Table 1 and Fig. 3(A)) reveals that the final, refined 3-D structurehas a small number of bad contacts and is comparable to the re-sults of the experimental structure. PROCHECK analysis indicatethat 81.7% of the amino acid residues are located in the most favor-able regions of the map (favorable or additional allowed region).When we compare the results obtained with the refined predictedstructure against the experimental structures we observe thatthese structures were improved in terms of RMSD (Table 2) andreduction of bad contacts (Table 1). PROMOTIF reveals that the fi-nal 1ZDD refined structure presents 73.5% (against 73.5% ofthe experimental 3-D structure) of their amino acid residues in aa-helix conformation and 26.5% (against 26.5% of the experimental3-D structure) of their residues in other structures as coils andturns and is identical to the experimental 1ZDD.

Fig. 6. Energy fluctuation along the 32-ns of the MD simulation. (A) Total energy fluctuation of the protein with PDB ID 1ZDD; (B) energy fluctuation of 1ALE; (C) 32-ns energyfluctuation of 1ARE and (D) energy fluctuation of the protein with PDB ID 1A11. Graphic representations were prepared with GNUPLOT.


In the second study case, 1ALE adopted a topology (Fig. 2(B)/Yellow) similar to their experimental 3-D structure (Fig. 2(B)/Cyan). Fig. 4(B) presents the secondary structure fluctuation alongthe refinement of the approximated 3-D structure. The mainchanges in the secondary structure along the MD simulation occurswith residues on the N-terminal and C-terminal region of main-chain structure. As expected residues in these regions change theirconformational states on ‘‘H’’ and irregular structures ‘‘t’’ or ‘‘T’’.After the minimization, production phase, 1ALE presents a total en-ergy of �1.41772e+05 kJ/mol and �1.23151e+05 kJ/mol, respec-tively. Fig. 6(B) presents the total energy fluctuation of therefinement of the 3-D structure of 1ALE along the MD simulation.Thourghout the simulation, the total energy of the polypeptide re-mains stable. Fig. 5(B) illustrates the Ca RMSD fluctuation duringthe refinement step. At the end of the refinement step 1ALE pre-sents a Ca of 0.55 Å with respect to the experimental structure,against 0.5 Å of the approximated structure predicted by A3N.The analysis of the amino acid residues distribution in the Rama-chandran plot (Table 1 and Fig. 3(B)) reveals that the final refined3-D structure have a small number of bad contacts. PROCHECKanalysis indicate that 93.3% of the amino acid residues are locatedin the most favorable regions of the map (favorable or additionalallowed region). This value is similar with respect to their experi-mental structure.

The case study 1ARE presents a more complex folding patternwhen compared with the 3-D structures of 1ZDD, 1ALE and1A11. After the simulation procedure 1ARE adopts a topology(Fig. 2(C)/Yellow) fairly similar to their experimental 3-D structure(Fig. 2(C)/Cyan). Fig. 4(C) presents the secondary structure fluctua-

tion along the refinement of the approximated 3-D structure of1ARE. The main changes in the secondary structure occur untilthe 18-ns of the MD simulation. From the 19-ns of the simulationchanges on the polypeptide occur mainly on the irregular struc-tures. Except in this case study the RMSD increases during all therefinement procedure. Fig.5(C) illustrates the Ca RMSD fluctuationduring the refinement step. 1ARE, at the end of the refinementstep, presents a Ca of 6.29 Å with respect to the experimental struc-ture, against 5.68 Å of the approximated structure predicted byA3N (Table 2). The analysis of the amino acid residues distributionin the Ramachandran plot (Table 1 and Fig. 3(C)) reveals that thefinal refined 3-D structure have a small number of bad contacts.PROCHECK analysis indicates that 55.6% of the amino acid residuesare located in the most favorable regions of the map (favorable oradditional allowed region). After the minimization and productionphase, 1ARE presents a total energy of �2.25045e+05 kJ/mol and�1.98329e+05 kJ/mol, respectively. Fig. 6(C) presents the total en-ergy fluctuation of the refinement of the 3-D structure of 1AREalong the MD simulation.

After the MD simulation, 1A11 adopts a topology (Fig. 2(D)/Yel-low) similar to their experimental 3-D structure (Fig. 2(D)/Cyan).Fig. 4(D) presents the secondary structure fluctuation along therefinement of the approximated 3-D structure. Fig. 5(D) illustratesthe Ca RMSD fluctuation during the refinement step. 1A11, at theend of the refinement step, presents a Ca of 1.27 Å with respectthe experimental structure, against 1.22 Å of the approximatedstructure predicted by A3N (Table 2). The analysis of the aminoacid residues distribution in the Ramachandran plot (Table 1 andFig. 3(D)) reveals that the final refined 3-D structure have a small


number of bad contacts. PROCHECK analysis indicate that 95.1% ofthe amino acid residues are located in the most favorable regionsof the map (favorable or additional allowed region). After the min-imization and production phase, 1A11 present a total energy of�2.28649e+05 kJ/mol and �1.96428e+05 kJ/mol, respectively.Fig. 6(D) presents the total energy fluctuation of the refinementof the 3-D structure of 1A11 throughout the MD simulation.

4. Conclusion

In this paper, we introduced a molecular dynamics and knowl-edge-based computational strategy to predict the 3-D structure ofpolypeptides. We investigate the use of classical molecular dynam-ics simulations performed in explicit water to refine approximatedstructures of proteins generated by the A3N method. The obtainedresults reveal that in the refinement step deviations in the poly-peptide structure are corrected and improved in terms of RMSDand with the reduction of bad contacts. Errors in variable loops,the relative orientations of secondary structure elements and inthe atomic packing were corrected.

The proposed approach reduces the total time of ab initio meth-ods which usually start from a fully extended conformation (Bredaet al., 2007).The overall contribution of this work is threefold: (I)First, the development of MD protocol to refine approximate 3-Dstructures predicted by a knowledge-based method; (II) successfulintegration of molecular dynamics simulations in the A3N methodand (III) the ab initio prediction of four mini proteins. This opensseveral interesting research avenues, with a range of applicationsin computational biology and bioinformatics. For instance, onecould apply the developed method to other classes of proteins; sec-ond, one could test other different clustering algorithms in order toimprove the A3N predictions; third, one could test different proto-cols to refine approximate 3-D structures.

Acknowledgement

The authors thank MCT/CNPq and CAPES (Brazil) for the finan-cial support.

References

Anfinsen, C., Haber, E., Sela, M., & White, F. H. Jr., (1961). The kinetics of formation ofnative ribonuclease during oxidation of the reduced polypeptide chain.Proceedings of National Academic Science and USA, 47, 1309–1314.

Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J., & Wheeler, D. (2009). Genbank.Nucleic Acids Research, 36, 25–30.

Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bath, T., Weissig, H., et al. (2000).The protein data bank. Nucleic Acids Research, 28(1), 235–242.

Bonneau, R., & Baker, D. (2001). Ab initio protein structure prediction: Progress andprospects. Annual Review of Biophysics and Biomolecular Structure, 30, 173–189.

Branden, C., & Tooze, J. (1998). Introduction to protein structure (2nd ed.). New York,USA: Garlang Publishing Inc.

Breda, A., Santos, D., Basso, L., & Norberto de Souza, O. (2007). Ab initio 3-Dstructure prediction of an artificially designed three-a-helix bundle via all-atommolecular dynamics simulations. Genetics and Molecular Research, 6(2),901–910.

Bryant, S. H., & Altschul, S. (1995). Statistics of sequence-structure threading.Current Opinion in Structural Biology, 5(2), 236–244.

Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., & Yannakakis, M.(1998). On the complexity of protein folding. Journal of Computional Biology,5(3), 423–466.

Darden, T., York, D., & Pedersen, L. (2009). Particle mesh ewald: An n.log n methodfor ewald sums in large systems. The Journal of Chemical Physics, 98(12),10089–10091.

Dorn, M., & Norberto de Souza, O. (2008). Cref: A central-residue-fragment-basedmethod for predicting approximate 3-D polypeptides structures. In Proceedingsof the 2008 ACM symposium on applied computing Vila Gale in Fortaleza, Ceara,Brazil (pp. 1261–1267).

Dorn, M., Breda, A., & Norberto de Souza, O. (2008). A hybrid method for the proteinstructure prediction problem. Lecture Notes on Bioinformatics, 5167, 47–56.

Dorn, M., & Norberto de Souza, O. (2010). Mining the protein data bank with cref topredict approximate 3-D structures of polypeptides. International Journal DataMining and Bioinformatics, 4(3), 281–299.

Fan, H., & Mark, A. (2004). Refinement of homology-based protein structures bymolecular dynamics simulation techniques. Protein Science, 13(1), 211–220.

Floudas, C., Fung, H., McAllister, S., Moennigmann, M., & Rajgaria, R. (2006).Advances in protein structure prediction and de novo protein design: A review.Chemical Engineering Science, 61(3), 966–988.

Hansson, T., Oostenbrink, C., & van Gunsteren, W. (2002). Molecular dynamicssimulations. Current Opinion in Structural Biology, 12, 190–196.

Hardin, C., Pogorelov, T., & Luthey-Schulten, Z. (2002). Ab initio protein structureprediction. Current Opinion in Structural Biology, 12, 176–181.

Hart, W., & Istrail, S. (1997). Robust proofs of np-hardness for protein folding:General lattices and energy potentials. Journal of Computional Biology, 4(1),1–22.

Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd ed.). New York,USA: Prentice Hall Inc.

Hess, B., Kutzner, C., van der Spoel, D., & Lindahl, E. (2008). Gromacs 4: Algorithmsfor highly efficient, load-balanced, and scalable molecular simulation. Journal ofChemical Theory and Computation, 4(3), 435–447.

Hoffman, R., Horvath, S., & Klevit, R. (1997). Structures of DNA-binding mutantzincinger domains: Implications for dna binding. Protein Science, 2, 951–965.

Hutchinson, E., & Thornton, J. (1996). Promotif: A program to identify and analyzestructural motifs in proteins. Protein Science, 5(2), 212–220.

Jones, D., Taylor, W., & Thornton, J. (1992). A new approach to protein foldrecognition. Nature, 358(6381), 86–89.

Karplus, M., & McCammon, J. (2002). Molecular dynamics simulations ofbiomolecules. Nature Structural Biology, 9, 646–652.

Laskowski, R., MacArthur, M., Moss, D., & Thornton, J. (1993). Procheck: A programto check the stereochemical quality of protein structures. Journal of AppliedCrystallography, 26(2), 283–291.

Lesk, A. M. (2002). Introduction to bioinformatics (1st ed.). New York, USA: OxfordUniversity Press Inc.

Levinthal, C. (1968). Are there pathways for protein folding? Journal De ChimiePhysique Et De Physico-Chimie Biologique, 65(1), 44–45.

Martì-Renom, M. A., Stuart, A., Fiser, A., Sanchez, A., Melo, F., & Sali, A. (2000).Comparative protein structure modeling of genes and genomes. Annual Reviewof Biophysics Biomolecular, 29, 291–325.

McLachlan, A. (1992). Rapid comparison of protein structures. Acta Crystallography,A38, 871–873.

Murzin, A. G., Brenner, S. E., Hubbard, T., & Cothia, C. (1995). Scop: A structuralclassification of proteins database for the investigation of sequences andstructures. Journal of Molecular Biology, 247(4), 536–540.

Ngo, J., Marks, J., & Karplus, M. (1997). The protein folding problem and tertiarystructure prediction. In K. Merz, Jr. & S. Grand (Eds.), Computational complexity,protein structure prediction and the Levinthal Paradox (pp. 435–508). Boston,USA: Birkhauser.

Opella, S., Marassi, F., Gesell, J., Valente, A., Kim, Y., Oblatt-Montal, M., et al. (1999).Structures of the m2 channel-lining segments from nicotinic acetylcholine andnmda receptors by nmr spectroscopy. Nature Structural Biology, 6(4), 279–374.

Osguthorpe, D. (2000). Ab initio protein folding. Current Opinion in StructuralBiology, 10(2), 146–152.

Ramachandran, G., & Sasisekharan, V. (1968). Conformation of polypeptides andproteins. Advances in Protein Chemistry, 23, 238–438.

Robustelli, P., Cavalli, A., & Vendruscolo, M. (2008). Determination of proteinstructures in the solid state from nmr chemical shifts. Structure, 16, 1764–1769.

Rohl, C., Strauss, C., Misura, K., & Baker, D. (2004). Protein structure prediction usingrosetta. Methods Enzymology, 383(2), 66–93.

Rozek, A., Buchko, G., & Cushley, R. (1995). Conformation of two peptidescorresponding to human apolipoprotein c-i residues 7-24 and 35-53 in thepresence of sodium dodecyl sulfate by cd and nmr spectroscopy. Biochemistry,34, 7401–7408.

Sánchez, R., & Sali, A. (1997). Advances in comparative protein-structure modeling.Current Opinion in Structural Biology, 7(2), 206–214.

Srinivasan, R., & Rose, G. (1995). Linus – a hierarchic procedure to predict the fold ofa protein. Proteins, 22(2), 81–99.

Starovasnik, M., Braisted, A., & Wells, J. (1997). Structural mimicry of a nativeprotein by a minimized binding domain. Proceedings of the National Academy ofSciences USA, 94, 10080–10085.

Sternberg, M., Bates, P., Kelley, L., & MacCallum, R. (1999). Progress in proteinstructure prediction: Assessment of casp3. Current Opinion in Structural Biology,9, 368–373.

Toukmaji, A., & Board, J. J. (1996). Ewald summation techniques in perspective: Asurvey. Computer Physics Communications, 95(2), 73–92.

Turcotte, M., Muggleton, S., & Sternberg, M. (1998). Application of inductive logicprogramming to discover rules governing the three-dimensional topology ofprotein structure. In Proceedings of the international workshop on inductive logicprogramming (pp. 53–64).

van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A., & Berendsen, H. (2005).Gromacs: Fast, flexible, and free. Journal of Computational Chemistry, 26(16),1701–1718.

van Gunsteren, W., & Berendsen, H. (1990). Computer simulation of moleculardynamics: Methodology, applications, and perspectives in chemistry.Angewandte Chemie-International Edition in English, 29(9), 992–1023.

a molecular dynamics and knowledge-based computational strategy to predict native-like structures of...

Documents