protein sequence- and structure-based similarity networks department of informatics and...
TRANSCRIPT
Protein Sequence- and Structure-
based
Similarity Networks
Department of Informatics and Communications, University of Athens, GreeceAthens, May 7th, 2008
Ioannis Valavanis1, George Spyrou2, Konstantina Nikita1
1Biomedical Simulations and Imaging Laboratory, School of Electrical and Computer Engineering, National
Technical University of Athens, GreeceGreece
2Biomedical Research Foundation of the Academy of Athens, Greece
Outline
Introduction
Construction of Similarity Networks
Analysis of Similarity Networks
Conclusions
Future Work
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Network: Set of vertices (elements of system) and edges (interrelations between elements) Network: Set of vertices (elements of system) and edges (interrelations between elements) Each real world system can be described by a NetworkEach real world system can be described by a Network
Biological and Chemical SystemsBiological and Chemical Systems Social Interacting SpeciesSocial Interacting Species Computer Networks and InternetComputer Networks and Internet
Quantification of a Network (Quantification of a Network (NN vertices, vertices, KK edges) reveals information on the System edges) reveals information on the System
Network Degree Network Degree kk=2K/N=2K/N Sparsity Sparsity S S = 2K/(N(N-1))= 2K/(N(N-1)) Centrality measurements of each vertex (degree of a vertex, betweeness of a vertex) Centrality measurements of each vertex (degree of a vertex, betweeness of a vertex) Randomness or Regularity based on average minimum path length (Randomness or Regularity based on average minimum path length (LL) and clustering coefficient () and clustering coefficient (CC))
Most real world systems behave like Small World Networks (SWNs) Most real world systems behave like Small World Networks (SWNs) (Strogatz, 2001)(Strogatz, 2001)
Few vertices (hubs) dominate the network activity Few vertices (hubs) dominate the network activity (Barabasi 2003) (Barabasi 2003) inin SWNsSWNs
Some known SWNs:Some known SWNs:
Film ActorsFilm Actors Peer to Peer NetworkPeer to Peer Network Metabolic NetworkMetabolic Network Yeast Protein InteractionsYeast Protein Interactions Functional Cortical Connectivity NetworkFunctional Cortical Connectivity Network
Networks and Real World Systems
a) Regular: Great a) Regular: Great LL and and CCb) Small World Network: Small or intermediate b) Small World Network: Small or intermediate LL and and
great great CC
c) Random: Small c) Random: Small LL and and CC
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Folded proteins were transformed to networks (Vendruscolo et al., 2002; Greene and Higman, 2003)
vertices: residuesedges: given a distance criterion
Protein Network: SWN
Residue closeness in residue interaction graphs identified functional residues (Amitai et al., 2004)
Similarity proteins networks were constructed and affinity of a protein in the network classified the protein in superfamily, family and fold level (Camoglu et al., 2006)
Structural similarity network of representative proteins of folds was shown to be a SWN (Sun et al., 2006)
Network-based description of Proteins
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Current Study
Quantify the structure of both protein sequence and structural similarity networks for several criteria
Compare protein sequence and structural similarity network at same sparsity
Find hubs in protein sequence and structural similarity network and their biological significance
Extend results to “fold” sequence and structural similarity networks
Assess the information level of sequence-derived features in comparison with structures in terms of fold and class assignment
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Protein Sequence Similarity Network
Well defined and used dataset of 311 sequences organized in fold and class level (Ding and Dubchak, 2001)
27 popular folds 4 classes (α, β, α/β, α+β)
Each sequence is represented by 125 sequence derived features (Dubchak et al., 1999 )
Amino Acid Composition (20 features) Predicted Secondary Structure (21 features) Hydrophobicity (21 features) Normalized van der Waal volumes (21 features) Polarity (21 features) Polarizability (21 features)
Each sequence is a vertex in the network
An edge occurs between 2 vertices given that the Euclidean distance of is lower than a threshold (dist ≤ 2.45, dist ≤ 2.1, dist ≤ 1.9, dist ≤ 1.6)
Fold Nseq.
α class
1 Globin-like 13
2 Cytochrome c 7
3 DNA-binding 3- helical bundle 12
4 4-helical up-and-down bundle 7
5 4-helical cytokines 9
6 Alpha;EF-hand 6
β class
7 Immunoglobin-like β-sandwich 30
8 Cupredoxins 9
9 Viral coat and capsid protein 16
10 ConA-like lectins/glucanases 7
11 SH3-like barrel 8
12 OB-fold 13
13 Trefoil 8
14 Trypsin-like proteases 9
15 Lipocalines 9
α/β class
16 (TIM)-barrel 29
17 FAD (also NAD)-binding motif 11
18 Flavodoxin-like 11
19 NAD(P)-binding Rossman-fold 13
20 P-loop containing nucleotide 10
21 Thioredoxin-like 9
22 Ribonuclease H-like motif 10
23 Hydrolases 11
24 Periplasmic binding protein-like 11
α+β class
25 β-grasp 7
26 Ferredoxin-like 13
27 Small inhibitors, toxins, lectins 13
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Protein Structural Similarity Network
Search of all proteins of the sequence data set in the Protein Data Bank (296 fully submitted structures found)
Each protein is represented by its 3-dimensional structure (3D coordinates of all atoms)
Structural similarity is given by the Z-score of structural alignment using DALI (Holm and Park, 2000) Z > 0 some similarity is found Z ≥ 2 significant similarity is found (Sun et al., 2006; Getz et al., 2004)
Each protein structure is a vertex in the network
An edge occurs between 2 vertices given a Z-score of Structural Similarity (Z >0, Z≥1, Z≥ 2)
Fold Nstruc.
α class
1 Globin-like 13
2 Cytochrome c 6
3 DNA-binding 3- helical bundle 12
4 4-helical up-and-down bundle 7
5 4-helical cytokines 7
6 Alpha;EF-hand 6
β class
7 Immunoglobin-like β-sandwich 27
8 Cupredoxins 9
9 Viral coat and capsid protein 16
10 ConA-like lectins/glucanases 6
11 SH3-like barrel 8
12 OB-fold 13
13 Trefoil 7
14 Trypsin-like proteases 8
15 Lipocalines 9
α/β class
16 (TIM)-barrel 28
17 FAD (also NAD)-binding motif 11
18 Flavodoxin-like 11
19 NAD(P)-binding Rossman-fold 13
20 P-loop containing nucleotide 9
21 Thioredoxin-like 9
22 Ribonuclease H-like motif 9
23 Hydrolases 11
24 Periplasmic binding protein-like 11
α+β class
25 β-grasp 7
26 Ferredoxin-like 11
27 Small inhibitors, toxins, lectins 12
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Quantification of Similarity Networks
For each Network (N vertices, K edges) we calculated: network degree k=2K/N Sparsity S=2K/(N(N-1)) fraction of all finite paths Isolated vertices
For each network, we got its fully connected version by removing isolated vertices
Network parameters were calculated once again
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Quantification of Similarity Networks
SimilarityNetwork
NIsolatedIsolatedverticesvertices
S (%) kFinite
paths (%)N*
IsolatedVertices*
S (%)* k*Finite
paths (%)*
dist≤2.45 311 19 (all orphans)19 (all orphans) 54.6 169.3 88.1 292 0 62 180.4 100
dist≤2.1 311 34 (all orphans)34 (all orphans) 34 105.4 79.3 277 0 42.9 118.3 100
dist≤1.9 311 47 (all orphans)47 (all orphans) 22.5 69.8 72 264 0 31.3 82.3 100
dist≤1.6 31197 (93 orphans97 (93 orphansand 2 isolatedand 2 isolated
pairs)pairs)9. 3 28.9 47.3 214 0 19.7 41.9 100
Z>0 296 2 (all orphans)2 (all orphans) 36.3 107 98.7 294 0 36.8 107.8 100
Z≥1 2969 (4 9 (4 οοrphansrphans andandone group of 5)one group of 5)
16.3 48.1 94.7 287 0 17.3 49.6 100
Z≥2 296
38 (13 orphans38 (13 orphansand 25 in isolatedand 25 in isolatedgroups with up togroups with up to
13 members)13 members)
8.4 24.9 76.2 258 0 10.9 28.0 100
Results
Interconnectivity keeps mostly high the fraction of finite paths even on networks with low S and k
Consecutive similarity transitions connect even distant proteins on networks
Interconnectivity is found more in structural level
The harder the similarity criterion gets, the sparser is the network and more isolated vertices are found
We have to remove less isolated vertices in structural level to get fully connected networks
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Protein Similarity Networks Adjacency Matrices
Depiction of adjacency matrices of 2 networks of almost same S~8-9% (dist≤1.6 (a), Z≥2 (b))
Dense rectangles along diagonal correspond to clustered proteins (classes and folds)
Clustering is more obvious in structural level
High clustering within α/β class is obvious on both networks
Single dots far from diagonal correspond to random edges between proteins of different classes or folds
Are Protein Similarity Networks SWNs?Are Protein Similarity Networks SWNs?
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Are Protein Similarity Networks SWNs?
We calculated L and C values for protein similarity networks
λij is the minimum path between vertices i,j
Ni is the number of neighbors of vertex i
Ki is the number of edges among neighbors of vertex i
L and C values were compared with the L and C values of random and regular networks of same S and N (Vendruscolo et al., 2002)
Lrandom = ln(N)/ln(k) Crandom=k/N Lregular=N(N+k-2)/[2k(N-1)] Cregular =3(k-2)/[4(k-1)]
Results
Similarity Networks have a intermediate L and great C and are SWNs
`
SimilaritySimilarityNetworkNetwork
LL CC LLrandomrandom CCrandomrandom LLregularregular CCregularregular
dist≤1.6dist≤1.6 2.4142.414 72.6%72.6% 1.4461.446 19.619.6 3.0453.045 73.2%73.2%
Z≥2Z≥2 3.3393.339 72.672.6%% 1.6851.685 10.910.9 5.0895.089 72.272.2%%
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Interrelations between and within Folds and Classes
The level of interconnectivity between and within folds and classes is studied
Comparison between sequence and structural similarity networks of same sparsity Sequence Similarity network (dist≤1.6)
Structural Similarity network (Z≥2)
Information of sequence and structure is assessed in terms of fold and class discrimination
Index used - FAPE (%): Fraction of All Possible Edges that occur Ni is the number of proteins in fold i , (1≤i≤27)
Ei,j is the number of edges between folds i and j (1≤i,j≤27)
(1≤i,j≤4)
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Interrelations between and within Folds and Classes FAPE values
FAPEfold values (a) and FAPEclass values (b) for the sequence similarity network (dist≤1.6)FAPEfold values (c) and FAPEclass values (d) for the structure similarity network (Z≥2)
Structures autocorrelate very well on fold and class level Little structural similarity exists between different classes Sequences autocorrelate less than structures in fold level There is more similarity between sequences of different folds than structures
(confirmed by sign-test for pairwise comparison, p<0.01)
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Interrelations between and within Folds and Classes FAPE values
FAPEfold values (a) and FAPEclass values (b) for the sequence similarity network (dist≤1.6)FAPEfold values (c) and FAPEclass values (d) for the structure similarity network (Z≥2)
Sequences include noise when used for fold and class assignment
Class α/β causes much interconnectivity within sequences and structures Some folds offer more interconnectivity than others and may dominate the network activity Fold 27 (small inhibitors, toxins, lectins) has the most internal dissimilarity and with other folds
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Graphical Representation of Protein Similarity Networks
Graphivz (Gansner and North, 2000)
(a) protein sequence similarity network (dist≤1.6), (b) protein structural similarity network (Z≥2) Light grey, dark grey, black and white spheres correspond to α, β, α/β and α+β class
The clustering of structures is very obvious Even isolated structures are found in groups Folds are connected directly or indirectly in structural level
Similarity transition and some clustering appears within sequences, as well Sequence similarity network is more confusing
The SWN hosts the similarity transition during evolution
Isolated vertices are the result of serious alteration during evolution
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Are there any hubs in the Protein Similarity Networks?
Betweenness (%) was measured for each vertex i
, σst(i) is 1 if the shortest path between vertices s, t passes through vertex i, otherwise is 0
High betweenness characterizes the vertices that dominate the network activity
Betweenness values are plotted in descending order for all vertices in protein sequence and structure similarity networks
solid line: Structures
dashed line: Sequences
Only few vertices dominate the network activity and function as hubs.
Which are they and their biological meaning? Let’s see to which folds they belong to!
- ! But first let us construct the “fold” similarity networks ! -
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Fold Similarity Networks
Two folds are connected in sequence or structural level given that at least two vertices satisfy the similarity criterion (dist≤1.6, dist≤1.6, Z≥2Z≥2))
(a) fold sequence similarity network (dist≤1.6), (b) fold structural similarity network (Z≥2) Light Grey, dark grey, black and white spheres correspond to α, β, α/β and α+β class Clustering is more obvious in structural level α/β class contributes more to the interconnectivity Few folds are found totally isolated
Betweenness of folds were found and combined with betweenness of proteins
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Betweenness Measurements in Similarity NetworksSequence-based analysis Structure-based analysis
Fold Fold betweenness (%)
Appearance of fold in first 20% of protein vertices
Foldbetweenness (%)
Appearance of fold in first 20%
of protein vertices
α class 1 Globin-like 7.38 5 13.85 5
2 Cytochrome c 0 0 0 1
3 DNA-binding 3- helical bundle 0 0 0 4
4 4-helical up-and-down bundle 5.85 1 0.92 2
5 4-helical cytokines 8.62 2 0 2
6 Alpha;EF-hand 0.62 0 0 2
β class 7 Immunoglobin-like β-sandwich 7.38 7 10.46 5
8 Cupredoxins 3.08 1 0 0
9 Viral coat and capsid protein 10.46 9 0 1
10 ConA-like lectins/glucanases 0 4 0 1
11 SH3-like barrel 0 0 0 0
12 OB-fold 0 2 1.23 4
13 Trefoil 0 0 0 0
14 Trypsin-like proteases 0 0 0 0
15 Lipocalines 0 2 11.08 4
α/β class 16 (TIM)-barrel 0.62 16 15.39 7 17 FAD (also NAD)-binding motif 0 0 11.39 6 18 Flavodoxin-like 0 0 0 3
19 NAD(P)-binding Rossman-fold 0 3 0 1
20 P-loop containing nucleotide 0 1 0.62 0
21 Thioredoxin-like 0 1 0 0 22 Ribonuclease H-like motif 0 2 4.92 2 23 Hydrolases 0 1 1.54 3 24 Periplasmic binding protein-like 0 2 0 1
α+β class 25 β-grasp 0 0 0 1 26 Ferredoxin-like 0 1 1.23 5
27 Small inhibitors, toxins, lectins 0 0 0 0
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Betweenness Measurements in Similarity Networks - Hubs
α/β class contains more structures hubs than other classes
• α/β proteins have many neighbors in networks that may have evolved from them In accordance with that α/β class is the most ancestral one! (Caetano-Anollés et al,
2003 )
• α/β proteins mediate the similarity transition, e.g. within the β α/β α evolution pathway
(Grishin, 2001)
Folds of high betweeness (Globin-like, OB-fold, FAD-(also NAD)-binding motif, Ferrodoxin- like) in structural level have been reported as more ancestral (Caetano-Anollés et al, 2003)
Similar results on betweenness were obtained in sequence level (α/β class, Globin-like fold)
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Conclusions
High interconnectivity was found in both sequence and structural similarity networks
Similarity transition and interconnectivity is more obvious in structural level and appears due to evolution
Both networks were classified as SWNs, like other real-world systems Hubs were found and related to the ancestrality of proteins/folds
Comparison of protein sequence and structure similarity networks at same sparsity showed commonalities:
Interconnectivity due to α/β class and certain foldsClustering in folds and classesIsolated folds on both levels
and differences:More clustering in structural levelMore interrelation of different folds and classes in sequence level The source of noise in sequences when used for fold and class assignment is obvious
The information quality of sequence derived features is still to be studied and optimized
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Future Work
Assess the quality of each of subsets of sequence-derived features used here, e.g. using graph similarity metrics between the two networks
Extend the pool of used sequence-derived features and optimize the features to use on a global network basis
Build as similar as possible protein similarity networks on sequence and structure level
Proceed to fold and class assignment using the sequence information in the optimized sequence similarity network, e.g. using graph-based classifiers
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
Literature
Vendruscolo,M., Dokholyan,N.V. Paci,E. and Karplus,M. (2002) Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E. 65, id.: 061910.
Barabasi AL. Linked: How everything is connected to everything else and what it means. New York: Plume Books; 2003.
Greene,L.H. and Higman,V.A. (2003) Uncovering Network Systems Within Protein Structures. J. Mol. Biol., 334, 781-791.
Amitai,G. et al.(2004) Network Analysis of Protein Structures Identifies Functional Residues. J. Mol. Biol., 344, 1135-1144.
Camoglu,O., Can,T., Singh,A.K. (2006) Integrating multi-attribute similarity networks for robust representation of the protein space. Bioinformatics, 22, 1585 - 1592.
Sun,Z.B., Zou,X.W., Guan,W. and Jin,Z.Z. (2006) The architectonic fold similarity network in protein fold space. Eur. Phys. J. B, 49, 127-134.
Ding, C.H.Q and Dubchak,I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 7, 349–358.
Dubchak et al. (1999) Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins, 35, 401-407.
Holm,L. and Sander,C. (1994) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123-138
Strogatz,S.H. (2001) Exploring complex networks. Nature, 410, 268–276
Gansner,E. and North,S. (2000) An open graph visualization system and its applications to software engineering. Softw, Pract. Exper., 30, 1203-1233.Getz,G., Starovolsky,A. and Domany,E. (2004) F2CS: FSSP to CATH and SCOP prediction server. Bioinformatics, 20, 2150-2152.
Caetano-Anollés G., Caetano-Anollés D. An evolutionarily structured universe of protein architecture. Genome Res 2003;13:1563-1571.
Grishin NV. Fold change in evolution of protein structures. J Struct Biol 2001;134:67-185.
Introduction
Similarity
Networks
Construction
Similarity
Networks
Analysis
Conclusions
Future Work
Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008
THANK YOU!!!!
QUESTIONS?
COMMENTS?
([email protected])____________________________________________
Visit
www.bibe2008.org for details on
8th IEEE International Conference on BioInformatics and BioEngineering (Athens, 8 -
10 October 2008)
(Paper Submission deadline: 15/6/2008)