cassert: a two-phase alignment algorithm for matching 3d structures of proteins

10
CASSERT: A Two-Phase Alignment Algorithm for Matching 3D Structures of Proteins Dariusz Mrozek and Bo˙ zena Malysiak-Mrozek Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland {dariusz.mrozek,bozena.malysiak-mrozek}@polsl.pl http://zti.polsl.pl/dmrozek Abstract. Protein structure alignment allows assessment of protein sim- ilarities and leads to the knowledge of the nature of proteins themselves. In this paper, we present a new version of the two-phase alignment algo- rithm for matching protein structures, called CASSERT. The algorithm can be used in scanning databases of protein structures while searching protein similarities. Effectiveness of the CASSERT was studied compar- ing its results to those returned by DALI algorithm. Performed tests con- firm that the CASSERT algorithm exhibits high effectiveness in protein structure similarity searching and can be a useful tool in the identifica- tion of proteins and their functions. Key words: structural bioinformatics, alignment, protein structure, sim- ilarity, structure matching 1 Introduction Protein structure similarity searching is one of the key, but most difficult tasks of modern structural bioinformatics [3]. While searching similarities based on amino acid sequence (primary structure [2]) is often carried out by using oper- ations on strings, protein structure comparison is more problematic due to the complex construction of proteins on a molecular level. If we assume that an av- erage size protein is made up of several hundreds of amino acids, and each amino acid is made up of several atoms, then a comparison of only one pair of protein structures is a challenge. If we also want to compare the structure of the protein with the entire database of proteins, for example, to compare mutant structures to each other, then taking into account the increasing number of protein struc- tures in databases, such as the Protein Data Bank (PDB) [1], this task becomes even more complicated. However, protein structure similarity searching is a very important task for the modern structural bioinformatics. Based on the information about similar protein structures we can conclude about common ancestry of organisms and thus, we can study the evolution of organisms over millions of years. The analy- sis of protein structures by their comparison allows us to search for substitutes This is an author's version of the paper. Original version in: A. Kwiecie , P. Gaj, and P. Stera (Eds.): CN 2013, CCIS 370, pp. 334-343, 2013 (C) Springer Verlag Berlin Heidelberg 2013 The final publication is available at link.springer.com: DOI: 10.1007/978-3-642-38865-1_34

Upload: polsl

Post on 23-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

CASSERT: A Two-Phase Alignment Algorithm

for Matching 3D Structures of Proteins

Dariusz Mrozek and Bozena MaÃlysiak-Mrozek

Institute of Informatics, Silesian University of Technology,Akademicka 16, 44-100 Gliwice, Poland

{dariusz.mrozek,bozena.malysiak-mrozek}@polsl.pl

http://zti.polsl.pl/dmrozek

Abstract. Protein structure alignment allows assessment of protein sim-ilarities and leads to the knowledge of the nature of proteins themselves.In this paper, we present a new version of the two-phase alignment algo-rithm for matching protein structures, called CASSERT. The algorithmcan be used in scanning databases of protein structures while searchingprotein similarities. Effectiveness of the CASSERT was studied compar-ing its results to those returned by DALI algorithm. Performed tests con-firm that the CASSERT algorithm exhibits high effectiveness in proteinstructure similarity searching and can be a useful tool in the identifica-tion of proteins and their functions.

Key words: structural bioinformatics, alignment, protein structure, sim-ilarity, structure matching

1 Introduction

Protein structure similarity searching is one of the key, but most difficult tasksof modern structural bioinformatics [3]. While searching similarities based onamino acid sequence (primary structure [2]) is often carried out by using oper-ations on strings, protein structure comparison is more problematic due to thecomplex construction of proteins on a molecular level. If we assume that an av-erage size protein is made up of several hundreds of amino acids, and each aminoacid is made up of several atoms, then a comparison of only one pair of proteinstructures is a challenge. If we also want to compare the structure of the proteinwith the entire database of proteins, for example, to compare mutant structuresto each other, then taking into account the increasing number of protein struc-tures in databases, such as the Protein Data Bank (PDB) [1], this task becomeseven more complicated.

However, protein structure similarity searching is a very important task forthe modern structural bioinformatics. Based on the information about similarprotein structures we can conclude about common ancestry of organisms andthus, we can study the evolution of organisms over millions of years. The analy-sis of protein structures by their comparison allows us to search for substitutes

This is an author's version of the paper.

Original version in: A. Kwiecie , P. Gaj, and P. Stera (Eds.): CN 2013, CCIS 370, pp. 334-343, 2013

(C) Springer Verlag Berlin Heidelberg 2013

The final publication is available at link.springer.com: DOI: 10.1007/978-3-642-38865-1_34

2 D. Mrozek, B. MaÃlysiak-Mrozek

for biological molecules critical for certain cellular processes, whose lack or in-adequate design can cause dysfunction of the body or serious diseases.

In this paper we present a new version of the two-phase alignment algorithm[12] for matching protein 3D structures, called CASSERT. The name of thealgorithm is an acronym from the words defining the representative features ofprotein structures taken into account in the comparison process - Cα atom (C),angle defined by vectors between successive Cα atoms (A), Secondary StructureElement (SSE), Residue Type (RT). Presented algorithm can be used in proteinstructure similarity searching. In the paper, we also present tests that we haveperformed in order to examine the effectiveness and efficiency of the algorithm.

2 Related Works

Several algorithms for protein structure similarity searching have been developedin the last two decades, including VAST [7], DALI [8], [9], LOCK2 [20], FatCat[24], ClusCo [10], CTSS [4], CE [21], DEDAL [6], RAPIDO [17], FAST [26],MICAN [15] and others [19], [25]. Taking into account the complexity of proteinstructures, existing algorithms use various representations of these structures inthe similarity searching process.

For example, the CTSS [4] algorithm includes local geometric features andselected biological characteristics. For each residue in a protein structure thealgorithm calculates shape signatures based on Cα atom positions, torsional an-gles, and type of the secondary structure. DALI algorithm [8], [9] makes use ofdistance matrices in the comparison process. These matrices are built for eachof compared proteins. The distance between the Cα atoms in amino acids i andj in the protein is stored in each cell of the distance matrix. Distance matricesare then decomposed to so-called contact patterns, which are fragments of 6x6elements of the matrix, and compared to find the best match. On the otherhand, the VAST algorithm [7] uses secondary structure elements (SSEs) formingthe cores of compared proteins (α-helices and β-sheets). SSEs are then mappedto the representative vectors, which simplifies the analysis process. During thecomparison, the algorithm attempts to match a set of vectors for pairs of pro-tein structures. The SSE representation of protein structures is also used in thecomparison method applied in the LOCK2 [20]. The CE [21] algorithm usesthe combinatorial extension of alignment path formed by aligned fragment pairs(AFPs). AFPs are fragments of both structures indicating a clear structural sim-ilarity and are described by local geometrical features, including positions of Cα

atoms. The idea of AFPs is also used in the FATCAT [24].

3 Two-phase Alignment Algorithm for Matching Protein

3D Structures

Protein similarity searching is typically performed by comparing the query pro-tein (Q) specified by the user with successive proteins (D) from the database

CASSERT: An Alignment Algorithm for Matching Protein Structures 3

of protein structures. In this chapter we present our newly developed algorithmfor fast and accurate comparison of two protein structures that can be used inscanning databases of proteins in order to find similar biological molecules.

3.1 First Phase – Comparison of Secondary Structures

In the first phase of the algorithm, protein structures Q and D are comparedby aligning their reduced chains of secondary structures formed by secondarystructure elements SEi:

Q = (SEQ1 , SE

Q2 , ..., SEQ

n ) and D = (SED1 , SED

2 , ..., SEDm), (1)

where: n is a number of secondary structures in the chain of the query proteinQ, m is a number of secondary structures in the chain of the database protein D.Elements SE

Qi and SED

j , hereinafter referred to as SE regions or SE fragments,are built from groups of adjacent amino acids forming the same type of secondarystructure (e.g. α-helix or β-strand, Fig. 1).

Fig. 1. Secondary structure elements: (left) four α-helices in sample structure [PDBID:1CE9], (right) two β-strands joined by a loop in sample structure [PDB ID: 1E0Q];visualized by MViewer [23]

We call this phase of the algorithm as the low resolution alignment. Eachelement SEi, which is a chain part isolated on the basis of its secondary structure,is characterized by two values:

SEi = [SSEi, Li], (2)

where: SSEi describes the type of secondary structure, Li is the length of theith element SEi (measured in residues). In the presented method, we distinguishthree basic types of secondary structures:

– α-helix,

– β-sheet or β-strand,

– undetermined structure, which represents loops, turns or coils.

4 D. Mrozek, B. MaÃlysiak-Mrozek

In order to match the structures Q and D we use the modified version of theSmith-Waterman alignment algorithm [22]. In the course of the algorithm, webuild the similarity matrix SSE of the size n × m, where n and m describe thenumber of secondary structures in compared chains of proteins Q and D, i.e.the number of fragments of Q and D chains of recognized secondary structure.Successive cells of the SSE matrix are filled according to the following rules:

for 0 ≤ i ≤ n and 0 ≤ j ≤ m:

SSEi,0 = SSE0,j = 0, (3)

SSE(1)i,j = SSEi−1,j−1 + δij , (4)

SSE(2)i,j = max

1≤k≤n{SSEi−k,j − ωk}, (5)

SSE(3)i,j = max

1≤l≤m{SSEi,j−l − ωl}, (6)

SSE(4)i,j = 0, (7)

SSEi,j = maxv=1..4

{SSE(v)i,j }. (8)

where: δij is a similarity reward, determining the similarity degree between two

components SEQi and SED

j of proteins Q and D, ωk, ωl are possible, horizontaland vertical penalties for inserting a gap of the length k and l. The similarityreward δij takes values from the interval 〈0; 1〉, where 0 means no similarity, while1 means the highest possible similarity. The degree of similarity is calculatedusing the formula:

δij = σij −

(

σij ∗|LD

j − LQi |

(LDj + L

Qi )

)

, (9)

where: LQi , LD

j are lengths of compared regions SEQi and SED

j , while σij de-

scribes the similarity degree of secondary structures building ith and jth SE

regions of compared proteins Q and D. This parameter can take three possiblevalues according to the following rules:

1. σij = 1, when both SE regions have the same secondary structure of α-helixor β-strand;

2. σij = 0.5, when at least one of the regions has undefined secondary structure;3. σij = 0, when one of the regions has the construction of α-helix and the

second the construction of β-strand.

3.2 Second Phase - Alignment of Structural Signatures

Molecules that passed the first phase of the aligment (based on the user-definedcut off value) are being further aligned in the second phase. A pair of alignedmolecules Q and D is now represented by structural signatures:

Q = (sQ1 , s

Q2 , ..., sQ

q ) and D = (sD1 , sD

2 , ..., sDd ), (10)

CASSERT: An Alignment Algorithm for Matching Protein Structures 5

where: q is a length of the query protein Q (i.e. a number of its amino acids), d

is a length of the database protein D, and each si corresponds to the ith aminoacid in the chain of the protein Q or D and is defined by the following vector offeatures:

si = 〈|Ci|, γi, SSEi, ri〉, (11)

where: |Ci| is a length of vector between Cα atoms of the ith and (i+1)th aminoacid in a protein chain, γi is an angle between successive vectors Ci and Ci+1,SSEi is a type of the secondary structure, which is formed by the ith residue,ri is a type of amino acid (Fig. 2).

Fig. 2. Structural features included in structural signatures: (top) atomic representa-tion with four residues visible (Met, Gln, Ile, Phe), (bottom left) vectors between Cα

atoms, and γ angle, (bottom right) secondary structure element for particular residues(β-strand in the presented case)

Alignment of structural signatures is performed similarly as in the first phase.However, it takes into account the result of the alignment of secondary structureelements SEi (SE regions) matched in the first phase. Results of the low reso-

lution alignment are projected onto the new similarity matrix. For each alignedpair of regions SE

Qi and SED

j from the previous phase, we calculate a new

similarity matrix of the size LQi × LD

j and we align structural signatures that

are inside the regions SEQi and SED

j . This matching process is called as a high

resolution alignment, which refines the results of the low resolution alignment,since it processes more structural features and in higher resolution.

6 D. Mrozek, B. MaÃlysiak-Mrozek

The course of the high resolution alignment itself is analogical to the low

resolution alignment. The main difference is the way how CASSERT calculatesthe similarity reward for two compared elements, which in this case are twostructural signatures si and sj . While calculating the similarity of two structuralsignatures the algorithm takes into account primary, secondary and tertiarystructures of each single protein element (corresponding to one amino acid).The similarity reward is calculated according to the following formula:

ssij = wC ∗ σCij + wγ ∗ σ

γij + wSSE ∗ σSSE

ij + wr ∗ σrij , (12)

where: σCij is a similarity degree of vectors C

Qi and CD

j describing the location

of Cα carbon atoms of residues i and j in proteins Q and D, σγij is a similarity of

angles γQi and γD

j in proteins Q and D, σSSEij is a similarity degree of secondary

structures of residues i and j (calculated according to the rules 1-3, as in the firstphase), σr

ij is a similarity degree of residues defined by means of the BLOSUM62substitution matrix normalized to range of 〈0; 1〉, wC , wγ , wSSE , wr are weightsfor all of the components (with default value of 1).

Similarity of vectors CQi and CD

j is defined according to the formula:

σCij = exp (−(|CQ

i | − |CDj |)2), (13)

and similarity of angles γQi and γD

j is defined as follows:

σγij = exp (−(|γQ

i | − |γDj |)2). (14)

The value of the similarity degree of structural signatures ssij (eq. 12) sub-stitutes the similarity reward δij (eq. 4) in the high resolution alignment.

3.3 Assessment of Protein Structure Similarity

In order to assess the similarity between two chains of structural signatures, weuse the Score measure. This measure is obtained for the optimal alignment pathin the similarity matrix (labeled in the second phase of the algorithm as matrixS). The value always accumulates all the possible rewards for a match, mismatchpenalties, and penalties for inserting gaps in the alignment (in accordance withequations 3-8) and is equal to the highest value in the similarity matrix S:

Score = max {Sij}, (15)

where i = 1, .., q, j = 1, .., d, q is the length of the query protein Q, and d is thelength of the database protein D.

The participation of each component in the similarity searching (eq. 12)can be controlled by means of participation weights, which are set by a user.For example, researchers looking only for surprising structural similarities, butindicating no sequence similarity at the same time, can disable the componentof the primary structure by setting the value of 0 for this particular component.

CASSERT: An Alignment Algorithm for Matching Protein Structures 7

4 Results and Discussion

The effectiveness of the CASSERT algorithm was examined during various tests.These tests were performed with the use of DALI database [8] storing 47 697molecular structures and 106 858 chains. The database was intalled locally on theMS SQL Server 2008R2 database management system working under controll ofthe Windows XP operating system. The size of the database was 12GB. In thefirst part, results of the CASSERT algorithm were compared to results returnedby DALI algorithm. In the second part, we compared alignments generated byCASSERT and DALI algorithms. In the third part, we compared alignmenttimes for CASSERT, DALI, CE, and FATCAT algorithms.

In the first series of tests, we compared lists of one hundred most similarprotein structures that were identified by both algorithms: CASSERT and DALI.For this purpose, we have arbitrary chosen a set of query proteins (Q). Thisset contained query molecules with different lengths and representing differentstructural classes according to the SCOP classification [18]: all α, all β, α&β,α + β. In this way, we could verify the efficacy of the low resolution alignment

phase. Query protein structures differed in size (length). We identified threegroups of protein structures - short-chain proteins (up to 100 amino acids),medium-sized (up to 500 amino acids), and long chains (over 500 amino acids).Each group of protein structures was tested using each of the two algorithms.Convergence of results was observed at the level of 99.8% for short-chain proteins,94.2% for medium-sized proteins, and 90.6% for long-chain proteins.

In the second series of experiments, we verified structural alignments thatwere generated by both algorithms. Also in this case we observed a large conver-gence of results. However, analyzing the results, we also found some cases, wherethe alignments were slightly different for the CASSERT and DALI algorithms.

One of the cases is presented in Fig. 3. It shows structural alignment for a pairof sample structures [PDBID: 1KDD, chain A] [11] and [PDB ID: 1CE9, chain B][13] from the DALI database. In the secondary structure of both molecules existseveral α-helices. Alignments generated by the two algorithms are slightly dif-ferent. In the alignment performed by DALI algorithm we can see the structuralsimilarity of only two amino acid residues (marked by a vertical line |), whilein the structural alignment performed by the CASSERT algorithm structuralconvergence of residues is much higher (marked by ’:’ symbol). Since with thestructural convergence, we can also observe the convergence of amino acids atthe same time (Fig. 3, right), it allows us to think that the structural alignmentof these positions are correct. The similarity or even identity of residues in thecompared chains, especially those observed on several successive elements, veryoften involves similar formation of the spatial structures of protein molecules. Onthis basis we conclude that including the sequence similarity into the structuralalignment has a positive effect on the final result of the alignment. This was alsoobserved in [5].

In the third series of experiments, we tested the performance of selected align-ment methods. We examined CASSERT and three popular algorithms: DALI,CE and FATCAT, measuring the time of the alignment performed for pairs of

8 D. Mrozek, B. MaÃlysiak-Mrozek

Fig. 3. Structural alignment generated by DALI algorithm (left) and CASSERT al-gorithm (right) for sample structures [PDBID: 1KDD, chain A] and [PDB ID: 1CE9,chain B]

molecules from the three groups of protein structures. Taking into account thatprotein structures are very complex and the search space is huge, the alignmentalgorithms are usually time consuming. All algorithms complete the alignmentprocess within several to tens of seconds. Alignment time highly depends onsizes (lengths) of compared structures. In Fig. 4 we show alignment time forexamined algorithms for a pair of sample molecules from the group of medium-sized proteins. Both compared structures had the length of 170 amino acids andrepresented different conformations of the same protein - human RAB5A andhuman RAB5A with a single mutation at Ala30. Tests were performed on the PCworkstation with the Intel Xeon CPU and 2GB RAM. In Fig. 4 we can observethat among all compared algorithms CASSERT has the lowest alignment time.This tendency was observed for all tested cases.

Having the processing time for just a single alignment, we can now easilyimagine how much time will take the process of finding similar protein structuresin a database storing 106 858 chains. This could take several days without anyadditional acceleration or filtering. Our research presented in [16], [14] confirmedthat for the FATCAT algorithm this takes 25 hours by implementing the processon 20 alignment agents working in parallel. CASSERT divides the time at leastby two, which we consider as a good result.

Fig. 4. Processing time while aligning a pair of molecular structures [PDB ID: 1N6H]vs. [PDB ID: 1N6N] for four tested algorithms: FATCAT, CE, DALI, and CASSERT

CASSERT: An Alignment Algorithm for Matching Protein Structures 9

5 Concluding Remarks

The CASSERT algorithm manifests high effectiveness in protein structure simi-larity searching. It is also characterized by a good precision, which was achievedby including in the comparison process a set of various features regarding pro-tein construction. The CASSERT algorithm, with the implemented method oftwo-phase alignment of protein structures, returns very good results. The resultsets returned by the CASSERT algorithm are comparable to that, which werereturned by the DALI algorithm. It has been shown in our tests. The alignmentsgenerated by both algorithms are similar, while in the course of the research wehave found cases, in which the CASSERT algorithm gave better alignment pathsthan popular DALI. Moreover, the computational complexity of the CASSERTalgorithm is lower than competitive DALI, CE and FATCAT and through theuse of the phase of low resolution alignment, our algorithm requires fewer it-erative alignments for successive structures from the database. This makes theCASSERT a very useful tool in the identification of proteins on the basis of theirstructures and in the identification of potential functions of these proteins.

Acknowledgments. This work was supported by the European Union fromthe European Social Fund (grant agreement number: UDA-POKL.04.01.01-00-106/09).

References

1. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)2. Branden, C., Tooze, J.: Introduction to Protein Structure, 2nd edn. Garland Science

(1999)3. Burkowski, F.: Structural Bioinformatics: An Algorithmic Approach, 1st edn. Chap-

man and Hall/CRC (2008)4. Can, T., Wang, Y.: CTSS: A robust and efficient method for protein structure

alignment based on local geometrical and biological features. In: Proceedings of the2003 IEEE Bioinformatics Conference (CSB 2003), pp. 169–179 (2003)

5. Daniels, N.M., Nadimpalli, S., Cowen, L.J.: Formatt: Correcting Protein MultipleStructural Alignments by Incorporating Sequence Alignment. BMC Bioinformatics13:259 (2012)

6. Daniluk, P., Lesyng, B.: A novel method to compare protein structures using localdescriptors. BMC Bioinformatics 12:344 (2011)

7. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison.Curr Opin Struct Biol 6(3), 377–385 (1996)

8. Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structuredatabases with DaliLite v.3. Bioinformatics 24, 2780–2781 (2008)

9. Holm, L., Sander, C.: Protein structure comparison by alignment of distance matri-ces. J Mol Biol 233(1), 123–38 (1993)

10. Jamroz, M., Kolinski, A.: ClusCo: clustering and comparison of protein models.BMC Bioinformatics 14:62 (2013)

11. Keating, A., Malashkevich, V., Tidor, B., Kim, P.: Side-chain repacking calcula-tions for predicting structures and stabilities of heterodimeric coiled coils. Proc NatlAcad Sci USA 98(26), 14,825–30 (2001)

10 D. Mrozek, B. MaÃlysiak-Mrozek

12. Krygowski, A., MaÃlysiak-Mrozek, B., Mrozek, D.: Two-phase alignment algorithmfor protein structure similarity searching. Studia Informatica vol. 33 No. 2A(105),525–541 (2012)

13. Lu, M., Shu, W., Ji, H., Spek, E., Wang, L., Kallenbach, N.: Helix capping in theGCN4 leucine zipper. J Mol Biol 288(4), 743–52 (1999)

14. MaÃlysiak-Mrozek, B., Momot, A., Mrozek, D., Hera, ÃL., Kozielski, S., Momot M.:Scalable System for Protein Structure Similarity Searching. Lecture Notes in Com-puter Science 6923, pp. 271-280 (2011)

15. Minami, S., Sawada, K., Chikenji, G.: MICAN : a protein structure alignmentalgorithm that can handle Multiple-chains, Inverse alignments, Ca only models,Alternative alignments, and Non-sequential alignments. BMC Bioinformatics 14:24(2013)

16. Momot A., MaÃlysiak-Mrozek B., Kozielski S., Mrozek D., Hera ÃL., Gorczynska-Kosiorz S., Momot M.: Improving Performance of Protein Structure SimilaritySearching by Distributing Computations in Hierarchical Multi-Agent System. LNAI6421, pp. 320-329 (2010)

17. Mosca, R., Brannetti, B., Schneider, T.R.: Alignment of protein structures in thepresence of domain motions. BMC Bioinformatics 9:352 (2008)

18. Murzin, A., Brenner, S., Hubbard, T., Chothia, C.: SCOP: A structural classifica-tion of proteins database for the investigation of sequences and structures. J MolBiol 247, 536–540 (1995)

19. Sam, V., Tai, C.H., Garnier, J., Gibrat, J.F., Lee, B., Munson, P.J.: Towards anautomatic classification of protein structural domains based on structural similarity.BMC Bioinformatics 9:74 (2008)

20. Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison andmotif discovery on the web. Nucleic Acids Res 32, 536–41 (2004)

21. Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinato-rial extension (CE) of the optimal path. Protein Engineering 11(9), 739–747 (1998)

22. Smith, T., Waterman, M.: Identification of common molecular subsequences. JMol Biol 147, 195–197 (1981)

23. Stanek, D., Mrozek, D., Malysiak-Mrozek, B.: MViewer: Visualization of proteinmolecular structures stored in the PDB, mmCIF and PDBML data formats (2013)

24. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairsallowing twists. Bioinformatics 19(2), 246–255 (2003)

25. Yuan, C., Chen, H., Kihara, D.: Effective inter-residue contact definitions for ac-curate protein fold recognition. BMC Bioinformatics 13:292 (2012)

26. Zhu, J., Weng, Z.: FAST: A novel protein structure algorithm. Proteins 58, 618–627 (2005)