rna 3d structure prediction: (1) assessing rna 3d ... · keywords: rna 3d structure, rna 2d...

9
112 Genome Informatics 15(2): 112–120 (2004) RNA 3D Structure Prediction: (1) Assessing RNA 3D Structure Similarity from 2D Structure Similarity Jaime E. Barreda DC 1 Yoshimitsu Shigenobu 2 [email protected] [email protected] Eiichiro Ichiishi 3 Carlos A. Del Carpio M. 1 , 3 [email protected] [email protected] 1 Department of Biochemisty and Pharmacology, Program of Biotechnological Engi- neering, The Catholic University of Santa Maria, Umacollo s/n, Arequipa, Peru 2 Department of Ecological Engineering, Toyohashi University of Technology 3 New Industry Creation Hatchery Center(NICHe), Tohoku University, Aoba, Sendai, Miyagi 980-8579, Japan Abstract Computational techniques for 3D structure prediction of proteins, the holy grail of bioinfor- matics, have undergone major developments in recent years, geared by international cooperation and competition with CASP (Critical Assessment of Structure Prediction Techniques) like contests to improve and refine them. Although straightforward extrapolation of these methodologies for the prediction of the 3D structures of other similarly relevant bio macromolecules may not be too compelling due mostly to the intrinsic differences in constitution, nature, and function between them, the conceptual framework underlying most of those techniques applied to the development of similar computational techniques in structural biology can lead to efficient systems for predic- tion of the 3D structure of other bio-macromolecules. One of them is the development of rational methodologies to model RNA 3D structures from the sequence of nucleotides composing them. In this paper we establish the fundamentals of a methodology to thread a sequence of nucleotides into a set of 3D fragments extracted from a data base expressly developed for this purpose. The tech- nique is based on a newly implemented algorithm for extraction of 3D fragments by comparison of secondary structures of RNA. The result is a highly efficient system to produce a set of fragments from which entire RNA structure for the given nucleotide sequence can be built. Keywords: RNA 3D structure, RNA 2D structure, structure prediction 1 Introduction In recent years the authors have been involved in the development of a bioinformatic approach oriented to the prediction of the secondary an tertiary structures of RNA molecules having as sole information that of the sequence of nucleotides constituting them [10, 11]. A mapping of the secondary structure elements into a hypothetical initial 3D model for a particular sequence is achieved in our system using conventional techniques such as distance geometry and/or multidimensional minimization algorithms based on evolutionary programming. A further refinement process based on optimization of the internal energy of the molecules leads to 3D structural models in fair agreement with the native structures whenever they are available for comparison. Nevertheless, the system has been limited by the length of the nucleotide sequence seldom exceeding a dozen of nucleotides. The particular relevance of these molecules not only in several well characterized stages of gene expression (messenger, transfer and ribosomic RNA) but their recently reported enzymatic and ap- tameric [1, 4] characteristics make the RNA folding process most intriguing and the development of

Upload: others

Post on 23-Jun-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

112 Genome Informatics 15(2): 112–120 (2004)

RNA 3D Structure Prediction: (1) Assessing RNA 3D

Structure Similarity from 2D Structure Similarity

Jaime E. Barreda DC1 Yoshimitsu Shigenobu2

[email protected] [email protected]

Eiichiro Ichiishi3 Carlos A. Del Carpio M.1,3

[email protected] [email protected]

1 Department of Biochemisty and Pharmacology, Program of Biotechnological Engi-neering, The Catholic University of Santa Maria, Umacollo s/n, Arequipa, Peru

2 Department of Ecological Engineering, Toyohashi University of Technology3 New Industry Creation Hatchery Center(NICHe), Tohoku University, Aoba, Sendai,

Miyagi 980-8579, Japan

Abstract

Computational techniques for 3D structure prediction of proteins, the holy grail of bioinfor-matics, have undergone major developments in recent years, geared by international cooperationand competition with CASP (Critical Assessment of Structure Prediction Techniques) like conteststo improve and refine them. Although straightforward extrapolation of these methodologies forthe prediction of the 3D structures of other similarly relevant bio macromolecules may not be toocompelling due mostly to the intrinsic differences in constitution, nature, and function betweenthem, the conceptual framework underlying most of those techniques applied to the developmentof similar computational techniques in structural biology can lead to efficient systems for predic-tion of the 3D structure of other bio-macromolecules. One of them is the development of rationalmethodologies to model RNA 3D structures from the sequence of nucleotides composing them. Inthis paper we establish the fundamentals of a methodology to thread a sequence of nucleotides intoa set of 3D fragments extracted from a data base expressly developed for this purpose. The tech-nique is based on a newly implemented algorithm for extraction of 3D fragments by comparison ofsecondary structures of RNA. The result is a highly efficient system to produce a set of fragmentsfrom which entire RNA structure for the given nucleotide sequence can be built.

Keywords: RNA 3D structure, RNA 2D structure, structure prediction

1 Introduction

In recent years the authors have been involved in the development of a bioinformatic approach orientedto the prediction of the secondary an tertiary structures of RNA molecules having as sole informationthat of the sequence of nucleotides constituting them [10, 11]. A mapping of the secondary structureelements into a hypothetical initial 3D model for a particular sequence is achieved in our system usingconventional techniques such as distance geometry and/or multidimensional minimization algorithmsbased on evolutionary programming. A further refinement process based on optimization of theinternal energy of the molecules leads to 3D structural models in fair agreement with the nativestructures whenever they are available for comparison. Nevertheless, the system has been limited bythe length of the nucleotide sequence seldom exceeding a dozen of nucleotides.

The particular relevance of these molecules not only in several well characterized stages of geneexpression (messenger, transfer and ribosomic RNA) but their recently reported enzymatic and ap-tameric [1, 4] characteristics make the RNA folding process most intriguing and the development of

RNA 3D Structure Prediction 113

methodologies to elucidate their tertiary structures of high priority in many biochemical and biomed-ical fields, since adequate knowledge of RNA 3D structures paves the path to a rational analysis ofthe gamma of functions they express and attracts studies oriented to determine applications for thesemolecules in diverse fields of molecular biology and medicine.

Several methodologies have been reported hitherto for the determination of the secondary structureof RNA; these are mainly based on minimization of energy functions that sum up terms expressing thestability of each kind of substructure in the molecule like stems, bulges, hairpins, internal, and multipleloops. Consequently, although energy functions may differ, the RNA 2D structure prediction problemis basically a sequence alignment problem. Therefore, RNA 2D structure prediction methods can bedivided into two types of techniques, the first of which consists in the search of the optimal alignmentof the sequence with itself using dynamic programming (DP). The second consists in generating allputative sets of stems in the structure and then select the optimal one. While the former methodologysearches exhaustively for the optimal solution or a local minimum in its neighborhood, the lattersearches for several different local minima solutions.

The first type of methodologies are exemplified by the widely used MFOLD system (Zuker [13, 14]),which implements an energy function based on experimental thermodynamic parameters obtained byTinoco and Salser [8]. Although recent improvements of the technique have allowed prediction ofstructures involving more complex substructures such as knots and pseudo-knots [2, 12], intrinsicdifficulties remain, including those stemming from the basic assumption that the search for the globalminimum of the particular energy function translates into conformational stability for the structure.Furthermore, to our knowledge, no relationship has been drawn between the predicted secondarystructure and the hypothetical tertiary structure it reflects. This remains an unsolved problem thatwe have attempted to approach here given the fact that, in spite of the difficulty in the experimentaldetermination of RNA 3D structures due to their high flexibility and weakness, some dozens of tRNAstructures have been elucidated experimentally and are available for studies of this nature.

The goal we pursue here is based on the assumption that similar secondary structures (bond-ing information of the secondary structures) reflect structures that also share similarities in tertiarystructure. Whether this underlying assumption proves correct then the problem of tertiary structureprediction for a sequence of nucleotides forming an RNA structure could become easily approachable,since it would suffice to search a data base of secondary structures with corresponding tertiary struc-tures to predict the most native like 3D structure for the query, or build it from sets of fragmentsextracted from the data base in that way. Moreover, even if an exact match of secondary structuresbetween the query structure and those in the data base is non existent, that of the most similar canbe selected to build the backbone of the tertiary structure, which could enhance the process of 3Dprediction of a completely unknown structure.

Here, we have collected information on available RNA 3D structures and derived their respective2D structures, constructed a data base, and developed a methodology to measure 2D structuralinformation similarity. Subsequently we assess the ability of inferring 3D structure similarity basedon the indices expressing secondary structure similarity. We report on these methodologies and theirperformance in computational experiments for RNA 3D structural prediction.

2 Method

2.1 Construction of a Relational RNA 2D-3D Structure Data Base (DB)

To establish similarity relationships among RNA secondary and tertiary structures the analysis oflarge amounts of information relative to these structures becomes necessary. The data bases developedhere are essentially of two types. The first is the data base of tertiary structures (3D-DB) for RNAextracted from the PDB (Brookhaven Protein Data Bank, 2002 issue), and the second is the data baseof secondary structures (2D-DB) derived from the tertiary structures in the first data base.

114 Barreda DC et al.

Although RNA’s can be categorized according to their functions as mentioned earlier, the database constructed for the purposes of this study does not take into account any type of classificationand all the RNA’s found were processed indistinctively. In the 2002 issue of the PDB, we have foundmore than 400 3D structures for different RNA molecules. According to the number of nucleotides inthe sequences, the number of processed RNA structures can be categorized as in Fig. 1.

under over

Figure 1: Distribution of RNA sequences in the 3D-DB.

The largest sequence corresponds to the RNA with PDB code izdk, having 2904 bases. BesidesPDB specific information for each molecule, each entry in the new data base also contains elementsof secondary structure such as hairpins of 3, 4, 5, . . ., 15 nucleotides, as well as bulges, internal loopstogether with the information of their position in the structure (Fig.2).

Figure 2: 3D-DB information for a single entry.

The RNA 2D-DB was constructed in a one-to-one correspondence to the 3D-DB. The rationalebehind this correspondence was the need for comparing RNA 3D structures and quantify that simi-larity.

Since the 3D structure of an RNA molecule is determined by hydrogen atoms among the fourbases (A-U and G-C), the translation from 3D bonding information to 2D bonding information was(considering that hydrogen atoms positions are not always reported in PDB) performed computing

RNA 3D Structure Prediction 115

the distances among nitrogen atoms in base pairs. The distance range for a hydrogen bond to existwas set to be in the interval of 2.1A to 3.7A. The procedure to determine the existence of a hydrogenbond consisted mainly in 3 steps: (i) extraction of the RNA structure whose bonding information wasknown, (ii)Computing around a dozen of distances among points on possible positions of base pairnitrogen atoms, (iii)The upper and lower limits for the bond length were the maximum and minimumdistances in step ii.

The information flow from 3D to 2D is illustrated in Fig.3, where the original RNA 3D structuretogether with the derived secondary structure and the final connectivity matrix are depicted.

Figure 3: 2D structure representation and bonding information flow in 2D-DB construction.

2.2 2D Structure Comparison: Optimization Algorithm

2.2.1 Matrix Superimposition

Superimposition of connectivity matrices representing two RNA secondary structures would be astraightforward method to compare them, however problems arise, as shown in Fig. 6, even whenthe two structures, almost identical, as is the case here are superimposed, leading to wrong computa-tions on structural similarity(sliding of the connectivity information shown in the rightmost matrix ofFig.4). To overcome this problem, we propose a methodology consisting in operations of informationcompression, smoothing, superimposition, frequency computation and scoring as depicted in the flowdiagram of Fig. 5.

000010000000100000000000001010000010100000100000000000000010000000100000001000000

000100000001000000010000010100000100000001000000010000000100000001000000

000110000001100000010000011110000110100001100000010000000110000001100000001000000

Figure 4: Matrix superimposition.

116 Barreda DC et al.

Input 2D structure

Shrinking

Smoothing

Superimposing

Output 2Dstructureswith highsimilarity

Figure 5: 2Dstructure compar-ison process.

Compressing (or shrinking) the information of the connectivity matrix rep-resenting the secondary structure of an RNA molecule has the objective of un-tangling and simplifying this information. If the original 2D connectivity ma-trix for a RNA molecule is represented by oriMTR(i, j) the contracted matrix,traMTR(i, j), is derived by the following calculation:

traMTR(i, j) =DEV∑i=1

DEV∑j=1

oriMTR(i, j) (1)

Where DEV is a parameter that allows control of the compression scale andvaries here from 2 to 4. The effect of the compression on the 2D connectivityinformation is exemplified in Fig. 6.

The smoothing operation consists in gradation or shading of the informationin a way similar to that used in image processing studies. The objective is toshade off some information when comparing two connectivity matrices that maylead to their optimal superimposition. Here we use the concept of filtering matrix(W) which multiplied to the original matrix (oriMTR(i, j)) yields the smoothedmatrix traMTR(i, j) (Eq. 2, 3).

021002000

000010000000100000000000001010000010100000100000000000000010000000100000001000000

000010000000100000000000001010000010100000100000000000000010000000100000001000000

021002000

For exam ple : DEV=3

Figure 6: Effect of information compression on the connectivity matrix of an RNA 2D structure.

W = C

⎡⎢⎣

a11 a12 a13

a21 a22 a23

a31 a32 a33

⎤⎥⎦ (2)

traMTR(i, j) = C{a11orgMTR(i − 1, j − 1) + a12orgMTR(i − 1, j) + a13orgMTR(i − 1, j + 1)+ a21orgMTR(i, j − 1) + a22orgMTR(i, j) + a23orgMTR(i, j + 1)+ a31orgMTR(i + 1, j − 1) + a32orgMTR(i + 1, j) + a33orgMTR(i + 1, j + 1)}

(3)

The effect of the smoothing process is illustrated in Fig. 7, where

RNA 3D Structure Prediction 117

0 zone

C onnec tivity zone

Figure 7: Smoothing process.

2.2.2 RNA 2D Structure Similarity Computation by Matrix Comparison

The operations described in the last section lead to a rapid and optimal superimposition of two 2Dconnectivity matrices, and consequently to an optimal computation of structural similarities. Thescore of similarity for 2 superimposed 2D connectivity matrices is computed as

Score =∑

Matchbit − ∑notbit∑

Allbit× 100, Allbit > 0 (4)

where Matchbit is the number of all 1’s on the information matrix of the query, Matchbit, the numberof bits that match when superimposing matrices of the query and a DB structure, and notbit, thenumber of bits (1’s) that don’t match on superimposition. Since DB matrices and query matrices areusually of different size, two procedures for comparing the matrices have been implemented. When thequery matrix is smaller than matrices in the DB, all the possibilities of superposition are computedsliding the query matrix through the diagonal of the DB matrix, the best superposition given by thebest Score computed by Eq. 4. On the contrary, when the query matrix is larger than matrices in theDB, the best score is calculated sliding the DB matrices on the query matrix.

2.2.3 RNA Structure Similarity Evaluation

To test the effectiveness of the value “Score” in the determination of similar structures, RNA 3D-structures corresponding to 2D structures, deemed similar by the methodology, were compared bysuperimposing them. To estimate 3D similarities, the RMS value was computed after superimposingthe 3D structures. Although high resolution in comparison can be achieved by computing RMS valuesconsidering all the atoms in the RNA structure, here we are most interested in the folding of theRNA molecule, thus, comparison of the P backbones of the molecules was performed considering onlyatoms constituting those backbones. To fit optimally two 3D RNA structures the 3 point techniquewas applied to all possible pairs of atoms in the structures. This technique consists in selecting 2points (atoms in the structures) whose direction vectors are positioned on axes of a 3D coordinatesystem. A third point is then selected and the structures are rotated so as to make the third pointpositions in both structures coincide in space. Repeating this process for pairs of backbone atoms onthe structures, one gets the best superimposition of the backbones represented by a low RMS value.

3 Results and Conclusions

Evaluation of the methodology described here was performed selecting sets of structures at random.One such set presented here for explanatory purposes consists of the structures shown in Table 1, wherethe structures have been ordered according to the length of the sequence each differing successively inapproximately 10 nucleotides.

The evaluation procedure consisted in (i) the introduction of the 2D information for the structures,(ii)calculation of the “Score” value of similarity for the input structure with structures in the 2D DB

118 Barreda DC et al.

and their ranking according to this value. (iii) Calculation of the RMS value for the 10 best structuresextracted by 2D structural similarity.

Since the evaluation set is extracted from the DB constructed as mentioned before, the highestsimilarity (100% similarity) structure is always obtained as the first structure among the 10 firststructures whose 3D configurations are compared.

Table 1: Set of structures for evaluation (see text).No. Name Number of base Structural feature1 1kis 15 HAIRPIN2 1f6x 26 M1 RNA3 1fmn 35 APTAMER4 1ili 45 E. coli 4.5S RNA5 1mms 57 L11 RNA6 1h4q 65 PROLYL-TRNA7 1yfg 76 INITIATOR TRNA8 5tra 84 SER TRNA

Therefore, the results shown here skip the 3D structure superposition of the input structure withthe firstly ranked structure, although, it is shown on each result table. The results in Fig.8 depict thetable of the top 10 structures extracted by 2D similarity for the target RNA with PDB code 1fmm.Here the name, the score value, the base number of the structure in the DB, the location of the match(locate), the match in terms of bases (base match) and in terms of the RMS (rms-match) are shown asentries of the table. Similarly Fig. 8 (lower half) depicts the superimposition of the 3 best structures(dark) with the query (light) .

Input file name: 1fmm, Number of bases: 35No. Name Score Base Num locate RMS Base match Ms match1 1fmn 100 35 0 0.00 100.00 100.00* 2 1gid 76 158 117 6.81 28.57 55.88* 3 1hr2 73 156 114 7.36 20.00 23.534 1ibk 73 1505 1407 6.81 22.86 57.145 1a9l 72 37 0 12.27 31.43 20.596 1f7y 72 56 15 7.40 57.14 48.577 1fjf 72 1506 393 7.03 40.00 37.148 1ibm 72 1506 393 7.04 40.00 34.299 2a9l 72 37 0 11.07 31.43 8.8210 1c2w 71 2898 528 11.48 40.00 28.57

No.3No.2 No.4

Figure 8: Top 10 similar structures to1fmn (up). Superimposition of the query with the three mostsimilar structures (down).

RNA 3D Structure Prediction 119

4 DiscussionA further evaluation is performed of the relationship between the 2 dimensional similarity index or“Score” and the RMS value computed for the superimposed 3D structures of the sequences deemedsimilar by the algorithm. The results for the evaluation set are shown in Table 2. Limiting thediscussion to the structures in this random set, a plot of these values for the set of structures is shownin Fig. 9.

Table 2: Best results for the evaluation set (see text).

Target RNA PDB code Score Base Num Locate RMS Base match Ms match1fmn 1gid 76 158 117 6.81 28.57 55.881kis 1atv 92 16 0 2.91 26.67 92.861f6x 1f6z 100 26 0 2.26 96.15 100.001fmn 1gid 76 158 117 6.81 28.57 55.881ili 1gid 77 158 111 10.39 20.00 15.56

1mms 1c2w 91 2898 1050 0.62 85.96 100.001h4q 1h4s 96 66 0 0.18 98.46 100.001yfg 1gax 89 75 0 3.73 56.00 88.005tra 1efw 15 68 0 12.64 33.82 16.42

Points in the plot represent pairs of Score-RMS values for the top best 10 similar structures toeach structure in the random set.

Figure 9: Relationship between the 2D similarity index “Score” and RMS values.

This evaluation leads to the conclusions that prediction values tend to be smaller as values for“Score” increase and RMS values decrease, for low “Score” values.

However predictive values (PV) are fairly good for “Score” values over 60 and RMS values under10 A. This is evident in Fig. 9, where there is a high concentration of points in the lower rightpart of the plot. The significance of this fact is that, similarity values of 60 or more at the 2Dstructural information lead to fair similar structures in 3D. This is similar to say that when thesecondary structures are highly similar (scores over 60) 3D structures sharing similar conformationalcharacteristics can be extracted from the DB. This is especially evident for small structures like 1kis,1f6x, 1fmn (base number, 15,26,35) for which the RMS values are remarkably small. A similar structure

120 Barreda DC et al.

for 5tra (number of bases 85) could not be found however, the RMS value is consequently large. Oneof the causes for this result is the fact that there were still relatively few structures with more than85 bases in the DB at the moment of the experiment.

Another characteristic of the methodology can be exemplified by the structure with PDB code1mms (57 bases long). Similar structures ranked 1 to 3 share no composition or any other kind ofcommon characteristic with the target, yet, comparison of the three dimensional structures shows thatthey are very similar. This supports our initial assumption that structures with different compositionand characteristics sharing similarity in secondary structure can lead to similarity in three dimensionalstructure.

This is very important when trying to predict structures of huge RNA molecules for which similarstructures in data bases are difficult to find, since the structure could be built from fragments ofstructure extracted from data bases using the methodology proposed here.

Several methods are being developed for prediction of the structures of biomolecules, and for RNAin particular [5, 6, 7, 9], similarity of structures at the secondary and tertiary level being a problemcommon to all these techniques, here we present a method that can be used in a versatile way toovercome it.

References[1] Cech, T.R., Zaug, A.J., and Grabowski, P.J., In vitro splicing of the ribosomal RNA precursor of

Tetrahymena: Involvement of a guanosine nucleotide in the excision of the intervening sequence,Cell, 27:487–496, 1981.

[2] Dam, E., Pleij, K., and Draper, D., Structural and functional aspects of RNA pseudoknots,Biochemistry, 31:11665–11676, 1992.

[3] Hubbard, J.M. and Hearst, J.E., Predicting the three-dimensional folding of transfer RNA witha computer modeling protocol, Biochemistry, 30:5458–5465, 1991.

[4] Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E., and Cech, T.R., Self-splicing RNA: Autoexcision and autocyclisation of the ribosomal RNA intervening sequence ofTetrahymena, Cell, 31:147–157, 1982.

[5] Leclerc, F., Srinivasan, J., and Cedergren, R., Predicting RNA structures: The model of the RNAelement binding Rev meets the NMR structure, Fold. Des., 2:141–147, 1997.

[6] Major, F., Turcotte, M., Gautheret, D., Lapalme, G., Fillion, E., Cedergren, R., The combina-tion of symbolic and numerical computation for three-dimensional modeling of RNA, Science,253:1225-1260, 1991.

[7] Ogata, H., Akiyama, Y., Kanehisa, M., A genetic algorithm based molecular modeling techniquefor RNA stem-loop structures, Nucleic Acids Res., 23(3):419–426, 1995.

[8] Puglisi, J.D., Wyatt, J.R., and Tinoco, I., A pseudoknotted RNA oligonucleotide, Nature,331:283–286, 1988.

[9] Shapiro, B.A. and Kasprzak, W., STRUCTURELAB: A heterogenous bioinformatics system forRNA structure analysis, J. Mol. Graph., 14:194–205, 1996.

[10] Shigenobu, Y. and Del Carpio, C.A., A bioinformatic approach for RNA 3D structure predic-tion: Development of a knowledge-base for 2D-to-3D structural elements compatibility analysis,Genome Informatics, 12:360–361, 2001.

[11] Shigenobu, Y. and Del Carpio, C.A., Development of a bioinformatic system for determination ofthe 3D structure of RNA from secondary structure constrains, Genome Informatics, 11:305–306,2000.

[12] Yamaguchi, K. and Del Carpio, C.A., A genetic programming based system for the prediction ofsecondary and tertiary structures of RNA, Genome Informatics, 9:382–383, 1998.

[13] Zuker, M., On finding all suboptimal foldings of an RNA molecule, Science, 48–52, 1989.[14] Zuker, M., Optimal computer folding of large RNA sequences using thermodynamics and auxiliary

information, Nucleic Acids Res., 9:133–148, 1981.