presentation 2007 journal club azhar ali shah

33
“Rapid Methods for Comparing Protein Structures and Scanning Structure Databases” [Oliviero Carugo, Current Bioinformatics;1(1), 2006] Azhar Ali Shah Computational Foundations of Nanoscience Journal Club (CFNJC) CFNJC, October 19, 2007

Upload: guest5de83e

Post on 11-Jul-2015

522 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Presentation 2007 Journal Club Azhar Ali Shah

“Rapid Methods for Comparing Protein Structures and Scanning

Structure Databases”

[Oliviero Carugo, Current Bioinformatics;1(1), 2006]

Azhar Ali ShahComputational Foundations of Nanoscience Journal Club (CFNJC)

CFNJC, October 19, 2007

Page 2: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

2/31

Overview Introduction

About the author Problem Requirements Motivations Background

Classification of methods Summary Observations

Page 3: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

3/31

Introduction: about the author 1/2 Name: Oliviero Carugo Nationality: Italian and French Education:

PhD (Chemistry), Univ. of Pavia, Italy, (1985 - 1986) Post Doc (Structural Biology Program), EMBL,

Heidelberg, Germany, (1995-2000)

Current Position: AP, Dept. of General Chemistry, Univ. of Pavia, Italy

(2000 --) Visiting Professor, Dept. of Biomolecular Structural

Chemistry, University of Vienna, Austria (2005 --)

Page 4: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

4/31

Introduction: about the author 2/2 Research interests:

Structural bioinformatics: Estimation of protein structure similarity, prediction of inter-molecular interactions, prediction of crystallizability of gene products

DBLP: Carugo CX, DPX and PRIDE: WWW servers for the analysis and

comparison of protein 3D structures. Nucleic Acids Research 33(Web-Server-Issue): 252-254 (2005)

DPX: for the analysis of the protein core. Bioinformatics 19(2): 313-314 (2003)

Prediction of protein polypeptide fragments exposed to the solvent. In Silico Biology 3: 35 (2003)

CX, an algorithm that identifies protruding atoms in proteins. Bioinformatics 18(7): 980-984 (2002)

Page 5: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

5/31

Introduction: problem 1/2 Complexity of the structural biological

information is increasing more rapidly as compared to computer performance Consider:

Number of PDB entries as structural biological information (PDB Graph)

Number of transistors per IC as a parameter of compute performance (Moore’s Law) Evaluation for 3 decades (1971 to 2003) gives:

Page 6: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

6/31

Introduction: problem 2/2

Number of PDB Structures

Number of transistors per IC (x 100, 000)

Confusing description!

Total structures in 2003: 20, 000Yearly growth in 2003: 5000

Page 7: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

7/31

Introduction: requirement Fast algorithms and protocols to measure

similarity b/w protein 3D structures available in large scale databases

Page 8: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

8/31

Introduction: motivations The estimation of similarity between

protein 3D structures helps in: Molecular evolution Molecular modelling Function prediction Database scanning

Page 9: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

9/31

Introduction: background 1/3

So many algorithms: Each biological problem requires its own

comparison method Different problems need different logical approaches

Page 10: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

10/31

Introduction: background 2/3 Slow methods

Careful examination of proximity among two or more proteins using structural alignment

Too slow for large databases Often use two step strategy

Coarse structure representation (e.g. SSE) Fine structure representation (e.g. positions of Cα

atoms)

Page 11: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

11/31

Introduction: background 3/3 Fast methods

Used for large scale databases Work on coarse representation of protein structures Results are less accurate and detailed (e.g. no

structural alignment)

Page 12: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

12/31

Introduction: focus of the paper Fast comparison methods that can handle

large scale structural databases

“Rapid Methods for Comparing Protein Structures and Scanning Structure Databases”

Page 13: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

13/31

Classification of methods Based on the representation of protein’s

3D structure: String Array Secondary structure elements (SSEs) Backbone

Page 14: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

14/31

String representation 1/4 Uncommon but appealing

Allows to use sequence alignment methods to compare 3D structures

3D structure of n residues/SSEs (or other structural units) is represented by n characters Characters are chosen from an alphabet Each character has associated structural

features

Page 15: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

15/31

String representation 2/4 Problem:

Difficult to design an appropriate alphabet that can well describe the 3D structural features

Comparison methods based on strings: TOPSCAN (Martin ACR, Protein Eng, 2000),UCL

Uses STRIDE program to identify SSEs Builds the vectors b/w the endpoints of SSEs SSEs are associated with one of the 12 characters on

the basis of larger component in the vector

Page 16: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

16/31

String representation 3/4

Page 17: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

17/31

String representation 4/4 Uses Needleman and Wunsch algorithm on string

representation of two 3D structures and calculates the percentage similarity score using following scheme

Should be 10?

How fast TOPSCAN is?

Page 18: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

18/31

Array representation 1/4 3D structure represented as a fixed length array of

real numbers Benefits:

For the comparison of equal length arrays there are well assessed mathematical tools based on proximity detection

E.g. Euclidian distance b/w two points in an orthogonal space

Problems Definition of the array

No obvious way to describe an object by means of predefined set of variables

Page 19: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

19/31

Array representation 2/4 Comparison methods based on arrays:

PRIDE (Carugo and Pongor, J Mol Bio 2002) Uses distances b/w Cα atoms to represent the 3D structure 28 histograms are computed for each structure e.g.

( ) ( ) 303, ≤≤+ nniandCiC αα

Fold similarity of two structures is estimated as the average of probability of identity scores obtained from the pairwise comparison of 28 histograms

Two histograms are compared through contingency table and χ2 Test to obtain the probability of identity score

Page 20: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

20/31

Page 21: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

21/31

Array representations 4/4 PRIDE results agreeable with CATH

Fast comparison 1000 comparisons per second

SGI R10000 system with 200 MHz

Page 22: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

22/31

Secondary structural elements (SSEs) 1/6

Simplified description of 3D structure i.e a few tens of SSEs as compared to several

tens or hundreds of residues Smaller number of variables make comparison

easier

Page 23: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

23/31

Secondary structural elements (SSEs) 2/6

Different ways to represent protein 3D structure by means of SSEs Secondary structural assignments SSE approximation

Page 24: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

24/31

Secondary structural elements (SSEs) 3/6

Secondary structural assignments Different assignments with different programs

Due to variable torsion angles along the backbone

Common methods: DSSP (Kabsch and Sander, Biopolymers 1983)

Dictionary of protein secondary structures Looks for hydrogen bonds b/w main-chain atoms and assigns

each residue with one of eight types of secondary structure conformations

STRIDE (Frishman and Argos, Proteins 1995) Uses both hydrogen bonds and torsion angles to assign

secondary structures

Page 25: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

25/31

Secondary structural elements (SSEs) 4/6 Other methods for SSE assignments

P-Curve DEFINE SSA VADAR Voronoi Tessellations

Contradiction in results DSSP and STRIDE agree in 96% (for 707 Ps) DSSP, STRIDE, DEFINE agree in 71% (for 126 Ps) DSSP, DEFINE, P-Curve agree in 63% (for 154 Ps)

Secondary structure assignments are quite ambiguous and inconsistent!

(consensus based on majority vote needed)

Serious limitation of the methods that compare 3D structures based

on SSE arrangements

Page 26: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

26/31

Secondary structural elements (SSEs) 5/6

SSE approximations As a vector from N to C terminus

Differ from arrays in terms of variable length Well assessed mathematical tools cannot be used

Different ways

Page 27: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

27/31

Secondary structural elements (SSEs) 6/6

Two-step methods based on SSEs SSM (Krissinel and Heinrick, EMBL 2003)

Secondary Structure Matching http://www.ebi.ac.uk/msd-srv/ssm/

Protein 3D structures are represented as graphs Nodes are SSEs

Graph comparison results in identification of equivalent residues

Subsequent minimization of RMSD b/w equivalent residues

DEJAVU (http://xray.bmc.uu.se/usf/) Matras (http://biunit.naist.jp/matras/) VAST(http://www.ncbi.nlm.nih.gov/Structure/VAST)

Statistical performance of SSM or other methods?

Two-step methods are slow?

Page 28: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

28/31

Backbone representations Uses vector based profiles to describe trajectories

from N to C terminus of backbone Trajectory could be described as a simple curve

Each residue is associated with the curvature and torsion of the curve

Differences of these parameters are used to compare two 3D structures

Useful when one compares same protein in two different states (e.g with or without a substrate, inhibitors and cofactors etc.)

It is hard to handle with gaps and insertions

Hardly used in general case for similarity evaluation and hence no public web servers are available.

However?

Page 29: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

29/31

Comparison b/w various methods For 86 queries, DALI gives best quality of

results as compared to: CE, Matras, PRIDE, SGM, Structal and VAST

(Sierk and Pearson, Protein Sc 2004)

For 70 queries CE, Dali, VAST and Matras provide better quality of results with high speed as compared to: DEJAVU, Lock, PRIDE, SSM, TOP, TOPS,

TOPSCAN (Novotony et al. Proteins 2004)

Strange!

Speed also depends on the power of computing environment the

algorithm runs on.

Page 30: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

30/31

Summary Rapid methods may use coarse representation of

3D structures in following forms: Strings

E.g TOPSCAN Arrays

E.g PRIDE SSEs

Two-step methods: SSM, DEJAVU, Matras, VAST Backbone

Algorithmic level studies: no public web servers

Comparison on same collection of data on same computing environment is useful: To benchmark the sate of the art of fast procedures

Page 31: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

31/31

Observations: Actual benchmarking of rapid methods on

large scale databases

Proper evaluation of methods based on different representations of protein’s 3D structure

Full classification of methods based on structure representation

Page 32: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

32/31

Source: www.intel.com/research/silicon/mooreslaw.htm

Page 33: Presentation 2007 Journal Club Azhar Ali Shah

Azhar A Shah

Rapid Methods for Comparing Protein Structures

and Scanning Structure Databases

33/31

Total

Yearly

Source: www.ncsb.org