protein sequence- and structure-based similarity networks department of informatics and...

Protein Sequence- and Structure-

based

Similarity Networks

Department of Informatics and Communications, University of Athens, GreeceAthens, May 7th, 2008

Ioannis Valavanis1, George Spyrou2, Konstantina Nikita1

1Biomedical Simulations and Imaging Laboratory, School of Electrical and Computer Engineering, National

Technical University of Athens, GreeceGreece

2Biomedical Research Foundation of the Academy of Athens, Greece

Outline

Introduction

Construction of Similarity Networks

Analysis of Similarity Networks

Conclusions

Future Work

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work

Department of Informatics and Communications, University of Athens, Athens, Greece, May 7th, 2008

Network: Set of vertices (elements of system) and edges (interrelations between elements) Network: Set of vertices (elements of system) and edges (interrelations between elements) Each real world system can be described by a NetworkEach real world system can be described by a Network

Biological and Chemical SystemsBiological and Chemical Systems Social Interacting SpeciesSocial Interacting Species Computer Networks and InternetComputer Networks and Internet

Quantification of a Network (Quantification of a Network (NN vertices, vertices, KK edges) reveals information on the System edges) reveals information on the System

Network Degree Network Degree kk=2K/N=2K/N Sparsity Sparsity S S = 2K/(N(N-1))= 2K/(N(N-1)) Centrality measurements of each vertex (degree of a vertex, betweeness of a vertex) Centrality measurements of each vertex (degree of a vertex, betweeness of a vertex) Randomness or Regularity based on average minimum path length (Randomness or Regularity based on average minimum path length (LL) and clustering coefficient () and clustering coefficient (CC))

Most real world systems behave like Small World Networks (SWNs) Most real world systems behave like Small World Networks (SWNs) (Strogatz, 2001)(Strogatz, 2001)

Few vertices (hubs) dominate the network activity Few vertices (hubs) dominate the network activity (Barabasi 2003) (Barabasi 2003) inin SWNsSWNs

Some known SWNs:Some known SWNs:

Film ActorsFilm Actors Peer to Peer NetworkPeer to Peer Network Metabolic NetworkMetabolic Network Yeast Protein InteractionsYeast Protein Interactions Functional Cortical Connectivity NetworkFunctional Cortical Connectivity Network

Networks and Real World Systems

a) Regular: Great a) Regular: Great LL and and CCb) Small World Network: Small or intermediate b) Small World Network: Small or intermediate LL and and

great great CC

c) Random: Small c) Random: Small LL and and CC

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Folded proteins were transformed to networks (Vendruscolo et al., 2002; Greene and Higman, 2003)

vertices: residuesedges: given a distance criterion

Protein Network: SWN

Residue closeness in residue interaction graphs identified functional residues (Amitai et al., 2004)

Similarity proteins networks were constructed and affinity of a protein in the network classified the protein in superfamily, family and fold level (Camoglu et al., 2006)

Structural similarity network of representative proteins of folds was shown to be a SWN (Sun et al., 2006)

Network-based description of Proteins

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Current Study

Quantify the structure of both protein sequence and structural similarity networks for several criteria

Compare protein sequence and structural similarity network at same sparsity

Find hubs in protein sequence and structural similarity network and their biological significance

Extend results to “fold” sequence and structural similarity networks

Assess the information level of sequence-derived features in comparison with structures in terms of fold and class assignment

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Protein Sequence Similarity Network

Well defined and used dataset of 311 sequences organized in fold and class level (Ding and Dubchak, 2001)

27 popular folds 4 classes (α, β, α/β, α+β)

Each sequence is represented by 125 sequence derived features (Dubchak et al., 1999 )

Amino Acid Composition (20 features) Predicted Secondary Structure (21 features) Hydrophobicity (21 features) Normalized van der Waal volumes (21 features) Polarity (21 features) Polarizability (21 features)

Each sequence is a vertex in the network

An edge occurs between 2 vertices given that the Euclidean distance of is lower than a threshold (dist ≤ 2.45, dist ≤ 2.1, dist ≤ 1.9, dist ≤ 1.6)

Fold Nseq.

α class

1 Globin-like 13

2 Cytochrome c 7

3 DNA-binding 3- helical bundle 12

4 4-helical up-and-down bundle 7

5 4-helical cytokines 9

6 Alpha;EF-hand 6

β class

7 Immunoglobin-like β-sandwich 30

8 Cupredoxins 9

9 Viral coat and capsid protein 16

10 ConA-like lectins/glucanases 7

11 SH3-like barrel 8

12 OB-fold 13

13 Trefoil 8

14 Trypsin-like proteases 9

15 Lipocalines 9

α/β class

16 (TIM)-barrel 29

17 FAD (also NAD)-binding motif 11

18 Flavodoxin-like 11

19 NAD(P)-binding Rossman-fold 13

20 P-loop containing nucleotide 10

21 Thioredoxin-like 9

22 Ribonuclease H-like motif 10

23 Hydrolases 11

24 Periplasmic binding protein-like 11

α+β class

25 β-grasp 7

26 Ferredoxin-like 13

27 Small inhibitors, toxins, lectins 13

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Protein Structural Similarity Network

Search of all proteins of the sequence data set in the Protein Data Bank (296 fully submitted structures found)

Each protein is represented by its 3-dimensional structure (3D coordinates of all atoms)

Structural similarity is given by the Z-score of structural alignment using DALI (Holm and Park, 2000) Z > 0 some similarity is found Z ≥ 2 significant similarity is found (Sun et al., 2006; Getz et al., 2004)

Each protein structure is a vertex in the network

An edge occurs between 2 vertices given a Z-score of Structural Similarity (Z >0, Z≥1, Z≥ 2)

Fold Nstruc.

α class

1 Globin-like 13

2 Cytochrome c 6

3 DNA-binding 3- helical bundle 12

4 4-helical up-and-down bundle 7

5 4-helical cytokines 7

6 Alpha;EF-hand 6

β class

7 Immunoglobin-like β-sandwich 27

8 Cupredoxins 9

9 Viral coat and capsid protein 16

10 ConA-like lectins/glucanases 6

11 SH3-like barrel 8

12 OB-fold 13

13 Trefoil 7

14 Trypsin-like proteases 8

15 Lipocalines 9

α/β class

16 (TIM)-barrel 28

17 FAD (also NAD)-binding motif 11

18 Flavodoxin-like 11

19 NAD(P)-binding Rossman-fold 13

20 P-loop containing nucleotide 9

21 Thioredoxin-like 9

22 Ribonuclease H-like motif 9

23 Hydrolases 11

24 Periplasmic binding protein-like 11

α+β class

25 β-grasp 7

26 Ferredoxin-like 11

27 Small inhibitors, toxins, lectins 12

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Quantification of Similarity Networks

For each Network (N vertices, K edges) we calculated: network degree k=2K/N Sparsity S=2K/(N(N-1)) fraction of all finite paths Isolated vertices

For each network, we got its fully connected version by removing isolated vertices

Network parameters were calculated once again

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Quantification of Similarity Networks

SimilarityNetwork

NIsolatedIsolatedverticesvertices

S (%) kFinite

paths (%)N*

IsolatedVertices*

S (%)* k*Finite

paths (%)*

dist≤2.45 311 19 (all orphans)19 (all orphans) 54.6 169.3 88.1 292 0 62 180.4 100

dist≤2.1 311 34 (all orphans)34 (all orphans) 34 105.4 79.3 277 0 42.9 118.3 100

dist≤1.9 311 47 (all orphans)47 (all orphans) 22.5 69.8 72 264 0 31.3 82.3 100

dist≤1.6 31197 (93 orphans97 (93 orphansand 2 isolatedand 2 isolated

pairs)pairs)9. 3 28.9 47.3 214 0 19.7 41.9 100

Z>0 296 2 (all orphans)2 (all orphans) 36.3 107 98.7 294 0 36.8 107.8 100

Z≥1 2969 (4 9 (4 οοrphansrphans andandone group of 5)one group of 5)

16.3 48.1 94.7 287 0 17.3 49.6 100

Z≥2 296

38 (13 orphans38 (13 orphansand 25 in isolatedand 25 in isolatedgroups with up togroups with up to

13 members)13 members)

8.4 24.9 76.2 258 0 10.9 28.0 100

Results

Interconnectivity keeps mostly high the fraction of finite paths even on networks with low S and k

Consecutive similarity transitions connect even distant proteins on networks

Interconnectivity is found more in structural level

The harder the similarity criterion gets, the sparser is the network and more isolated vertices are found

We have to remove less isolated vertices in structural level to get fully connected networks

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Protein Similarity Networks Adjacency Matrices

Depiction of adjacency matrices of 2 networks of almost same S~8-9% (dist≤1.6 (a), Z≥2 (b))

Dense rectangles along diagonal correspond to clustered proteins (classes and folds)

Clustering is more obvious in structural level

High clustering within α/β class is obvious on both networks

Single dots far from diagonal correspond to random edges between proteins of different classes or folds

Are Protein Similarity Networks SWNs?Are Protein Similarity Networks SWNs?

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Are Protein Similarity Networks SWNs?

We calculated L and C values for protein similarity networks

λij is the minimum path between vertices i,j

Ni is the number of neighbors of vertex i

Ki is the number of edges among neighbors of vertex i

L and C values were compared with the L and C values of random and regular networks of same S and N (Vendruscolo et al., 2002)

Lrandom = ln(N)/ln(k) Crandom=k/N Lregular=N(N+k-2)/[2k(N-1)] Cregular =3(k-2)/[4(k-1)]

Results

Similarity Networks have a intermediate L and great C and are SWNs

`

SimilaritySimilarityNetworkNetwork

LL CC LLrandomrandom CCrandomrandom LLregularregular CCregularregular

dist≤1.6dist≤1.6 2.4142.414 72.6%72.6% 1.4461.446 19.619.6 3.0453.045 73.2%73.2%

Z≥2Z≥2 3.3393.339 72.672.6%% 1.6851.685 10.910.9 5.0895.089 72.272.2%%

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Interrelations between and within Folds and Classes

The level of interconnectivity between and within folds and classes is studied

Comparison between sequence and structural similarity networks of same sparsity Sequence Similarity network (dist≤1.6)

Structural Similarity network (Z≥2)

Information of sequence and structure is assessed in terms of fold and class discrimination

Index used - FAPE (%): Fraction of All Possible Edges that occur Ni is the number of proteins in fold i , (1≤i≤27)

Ei,j is the number of edges between folds i and j (1≤i,j≤27)

(1≤i,j≤4)

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Interrelations between and within Folds and Classes FAPE values

FAPEfold values (a) and FAPEclass values (b) for the sequence similarity network (dist≤1.6)FAPEfold values (c) and FAPEclass values (d) for the structure similarity network (Z≥2)

Structures autocorrelate very well on fold and class level Little structural similarity exists between different classes Sequences autocorrelate less than structures in fold level There is more similarity between sequences of different folds than structures

(confirmed by sign-test for pairwise comparison, p<0.01)

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Interrelations between and within Folds and Classes FAPE values

FAPEfold values (a) and FAPEclass values (b) for the sequence similarity network (dist≤1.6)FAPEfold values (c) and FAPEclass values (d) for the structure similarity network (Z≥2)

Sequences include noise when used for fold and class assignment

Class α/β causes much interconnectivity within sequences and structures Some folds offer more interconnectivity than others and may dominate the network activity Fold 27 (small inhibitors, toxins, lectins) has the most internal dissimilarity and with other folds

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Graphical Representation of Protein Similarity Networks

Graphivz (Gansner and North, 2000)

(a) protein sequence similarity network (dist≤1.6), (b) protein structural similarity network (Z≥2) Light grey, dark grey, black and white spheres correspond to α, β, α/β and α+β class

The clustering of structures is very obvious Even isolated structures are found in groups Folds are connected directly or indirectly in structural level

Similarity transition and some clustering appears within sequences, as well Sequence similarity network is more confusing

The SWN hosts the similarity transition during evolution

Isolated vertices are the result of serious alteration during evolution

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Are there any hubs in the Protein Similarity Networks?

Betweenness (%) was measured for each vertex i

, σst(i) is 1 if the shortest path between vertices s, t passes through vertex i, otherwise is 0

High betweenness characterizes the vertices that dominate the network activity

Betweenness values are plotted in descending order for all vertices in protein sequence and structure similarity networks

solid line: Structures

dashed line: Sequences

Only few vertices dominate the network activity and function as hubs.

Which are they and their biological meaning? Let’s see to which folds they belong to!

- ! But first let us construct the “fold” similarity networks ! -

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Fold Similarity Networks

Two folds are connected in sequence or structural level given that at least two vertices satisfy the similarity criterion (dist≤1.6, dist≤1.6, Z≥2Z≥2))

(a) fold sequence similarity network (dist≤1.6), (b) fold structural similarity network (Z≥2) Light Grey, dark grey, black and white spheres correspond to α, β, α/β and α+β class Clustering is more obvious in structural level α/β class contributes more to the interconnectivity Few folds are found totally isolated

Betweenness of folds were found and combined with betweenness of proteins

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Betweenness Measurements in Similarity NetworksSequence-based analysis Structure-based analysis

Fold Fold betweenness (%)

Appearance of fold in first 20% of protein vertices

Foldbetweenness (%)

Appearance of fold in first 20%

of protein vertices

α class 1 Globin-like 7.38 5 13.85 5

2 Cytochrome c 0 0 0 1

3 DNA-binding 3- helical bundle 0 0 0 4

4 4-helical up-and-down bundle 5.85 1 0.92 2

5 4-helical cytokines 8.62 2 0 2

6 Alpha;EF-hand 0.62 0 0 2

β class 7 Immunoglobin-like β-sandwich 7.38 7 10.46 5

8 Cupredoxins 3.08 1 0 0

9 Viral coat and capsid protein 10.46 9 0 1

10 ConA-like lectins/glucanases 0 4 0 1

11 SH3-like barrel 0 0 0 0

12 OB-fold 0 2 1.23 4

13 Trefoil 0 0 0 0

14 Trypsin-like proteases 0 0 0 0

15 Lipocalines 0 2 11.08 4

α/β class 16 (TIM)-barrel 0.62 16 15.39 7 17 FAD (also NAD)-binding motif 0 0 11.39 6 18 Flavodoxin-like 0 0 0 3

19 NAD(P)-binding Rossman-fold 0 3 0 1

20 P-loop containing nucleotide 0 1 0.62 0

21 Thioredoxin-like 0 1 0 0 22 Ribonuclease H-like motif 0 2 4.92 2 23 Hydrolases 0 1 1.54 3 24 Periplasmic binding protein-like 0 2 0 1

α+β class 25 β-grasp 0 0 0 1 26 Ferredoxin-like 0 1 1.23 5

27 Small inhibitors, toxins, lectins 0 0 0 0

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Betweenness Measurements in Similarity Networks - Hubs

α/β class contains more structures hubs than other classes

• α/β proteins have many neighbors in networks that may have evolved from them In accordance with that α/β class is the most ancestral one! (Caetano-Anollés et al,

2003 )

• α/β proteins mediate the similarity transition, e.g. within the β α/β α evolution pathway

(Grishin, 2001)

Folds of high betweeness (Globin-like, OB-fold, FAD-(also NAD)-binding motif, Ferrodoxin- like) in structural level have been reported as more ancestral (Caetano-Anollés et al, 2003)

Similar results on betweenness were obtained in sequence level (α/β class, Globin-like fold)

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Conclusions

High interconnectivity was found in both sequence and structural similarity networks

Similarity transition and interconnectivity is more obvious in structural level and appears due to evolution

Both networks were classified as SWNs, like other real-world systems Hubs were found and related to the ancestrality of proteins/folds

Comparison of protein sequence and structure similarity networks at same sparsity showed commonalities:

Interconnectivity due to α/β class and certain foldsClustering in folds and classesIsolated folds on both levels

and differences:More clustering in structural levelMore interrelation of different folds and classes in sequence level The source of noise in sequences when used for fold and class assignment is obvious

The information quality of sequence derived features is still to be studied and optimized

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Future Work

Assess the quality of each of subsets of sequence-derived features used here, e.g. using graph similarity metrics between the two networks

Extend the pool of used sequence-derived features and optimize the features to use on a global network basis

Build as similar as possible protein similarity networks on sequence and structure level

Proceed to fold and class assignment using the sequence information in the optimized sequence similarity network, e.g. using graph-based classifiers

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


Literature

Vendruscolo,M., Dokholyan,N.V. Paci,E. and Karplus,M. (2002) Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E. 65, id.: 061910.

Barabasi AL. Linked: How everything is connected to everything else and what it means. New York: Plume Books; 2003.

Greene,L.H. and Higman,V.A. (2003) Uncovering Network Systems Within Protein Structures. J. Mol. Biol., 334, 781-791.

Amitai,G. et al.(2004) Network Analysis of Protein Structures Identifies Functional Residues. J. Mol. Biol., 344, 1135-1144.

Camoglu,O., Can,T., Singh,A.K. (2006) Integrating multi-attribute similarity networks for robust representation of the protein space. Bioinformatics, 22, 1585 - 1592.

Sun,Z.B., Zou,X.W., Guan,W. and Jin,Z.Z. (2006) The architectonic fold similarity network in protein fold space. Eur. Phys. J. B, 49, 127-134.

Ding, C.H.Q and Dubchak,I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 7, 349–358.

Dubchak et al. (1999) Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins, 35, 401-407.

Holm,L. and Sander,C. (1994) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123-138

Strogatz,S.H. (2001) Exploring complex networks. Nature, 410, 268–276

Gansner,E. and North,S. (2000) An open graph visualization system and its applications to software engineering. Softw, Pract. Exper., 30, 1203-1233.Getz,G., Starovolsky,A. and Domany,E. (2004) F2CS: FSSP to CATH and SCOP prediction server. Bioinformatics, 20, 2150-2152.

Caetano-Anollés G., Caetano-Anollés D. An evolutionarily structured universe of protein architecture. Genome Res 2003;13:1563-1571.

Grishin NV. Fold change in evolution of protein structures. J Struct Biol 2001;134:67-185.

Introduction

Similarity

Networks

Construction

Similarity

Networks

Analysis

Conclusions

Future Work


THANK YOU!!!!

QUESTIONS?

COMMENTS?

([email protected])____________________________________________

Visit

www.bibe2008.org for details on

8th IEEE International Conference on BioInformatics and BioEngineering (Athens, 8 -

10 October 2008)

(Paper Submission deadline: 15/6/2008)

protein sequence- and structure-based similarity networks department of informatics and...

Documents

structural similarity

similarity proteins

speciescomputer networks

networks vendruscolo

protein sequence

network n vertices

system network degree

cb small world network