graph layout in cellular networks

45
9. Lecture WS 2004/05 Bioinformatics III 1 Graph Layout in Cellular Networks www.cytoscape.org

Upload: kiayada-blanchard

Post on 03-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Graph Layout in Cellular Networks. www.cytoscape.org. Task: visualize cellular interaction data. e.g. protein interaction data (undirected): nodes – proteins edges – interactions metabolic pathways (directed) nodes – substances edges – reactions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 1

Graph Layout in Cellular Networks

www.cytoscape.org

Page 2: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 2

Task: visualize cellular interaction data

e.g. protein interaction data (undirected): nodes – proteinsedges – interactions

metabolic pathways (directed)nodes – substancesedges – reactions

regulatory networks (directed): nodes – transcription factors + regulated proteinsedges – regulatory interaction

co-localization (undirected): nodes – proteins

edges – co-localization information

homology (undirected/directed)nodes – proteinsedges – sequence similarity (BLAST score)

Page 3: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 3

Visualisation: intuitive approach to understand graphs

http://www.it.usyd.edu.au/~aquigley/3dfade/

Graph like structures are pervasive:

- route maps of airline companies

- infrastructure of computer networks

- the relationship between people who work in a same company etc.

- cellular interactions ...

One way to understand the information coded in these graphs is to draw

graphical representations of them. Since drawing by hand is tedious and

error-prone, it is natural to expect computers to draw graphs automatically,

assigning spatial coordinates to nodes and connecting them with edges.

Graphs, such as the flight route maps, are not hard to draw since the

precise locations of the nodes (cities) are already given.

For other graphs, such information is not available and computers need to

determine where to plot the nodes and how to draw the edges that connect

the nodes.

Page 4: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 4

Force-directed algorithm for graph layout

http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html

Various graph layout algorithms have been

developed to solve this visualisation task.

20 years ago, Peter Eades proposed a graph

layout heuristic [A heuristic for graph

drawing. Congressus Numerantium, 42:149-

160, 1984] which is called the ``Spring

Embedder'' algorithm.

Edges are replaced by springs and vertexes

are replaced by rings that connect the

springs. A layout can be found by simulating

the dynamics of such a physical system.

This method and other methods, which

involve similar simulations to compute the

layout, are called ``Force Directed''

algorithms.

Page 5: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 5

Force-directed algorithm

http://www.it.usyd.edu.au/~aquigley/3dfade/

The edges can be modeled as gravitational (or electrostatic) attraction and

all nodes have an electrical repulsion between them.

It is also possible for the system to simulate unnatural forces acting on the

bodies, which have no direct physical analogy, for example the use of a

logarithmic distance measure rather than Euclidean.

Page 6: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 6

Force-directed algorithm

http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html

Because of the underlying analogy to a physical system, the force directed graph

layout methods tend to meet various aesthetic standards, such as

- efficient space filling,

- uniform edge length (when equal weights and repulsions are used)

- symmetry and the

- capability of rendering the layout process with smooth animation (visual

continuity).

Having these nice features, the force directed graph layout has become

the ``work horse'' of layout algorithms.

It has been successfully adapted to many domains with variations of

implementation.

Page 7: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 7

Scaling

http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html

Force directed layout methods commonly have computational scaling problems.

When there are more than a few thousand vertexes in the graph, the running time

of the layout computation can become unacceptable.

This is caused by the fact that in each step of the simulation, the repulsive force

between each pair of unconnected vertexes needs to be computed, costing a

running time of O(0.5 V2 – E).

Here V is the number of vertexes and E is the number of edges in the graph.

This complexity is hard to escape for general graphs without hierarchical structure.

Page 8: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 8

Protein interaction graphs

Ju et al. Bioinformatics 19, 317 (2003)

Most protein interaction data have the following characteristics:

(1) When visualized as a graph, the data yields a disconnected graph with many

connected components

(2) The data yields a nonplanar graph with a large number of edge crossings that

cannot be removed in a 2D drawing

(3) #interactions varies widely within the same set of data – p(k)

(4) data often contains protein interactions corresponding to self loops

demands robust algorithm.

Page 9: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 9

InterViewer: Example of force-directed layout algorithm

Ju et al. Bioinformatics 19, 317 (2003)

InterViewer does not place initial nodes

randomly, but on the surface of a

sphere. Fixed # of iterations.

The original algorithm has complexity

O(N2) per timestep with N # of nodes.

When using multipole-methods, this

can be reduced to O(N logN)

Time may also be saved by introducing

a cut-off, e.g. only computing

interactions with the next neighbor

cells. Update neighbor list infrequently.

Page 10: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 10

Application for protein interaction graphs

Ju et al. Bioinformatics 19, 317 (2003)

Visualisation of the

MIPS interaction data.

In 3D, this graph

contains no edge-

crossings.

Page 11: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 11

Aim: analyze and visualize homologies between the protein universe :-)

50 genomes 145579 proteins 21 109 BLASTP pairwise sequence

comparisons.

Expect that fusion proteins („Rosetta Stone proteins“) will link proteins of

related function.

Need to visualize extremely large network! Develop stepwise scheme.

Page 12: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 12

LGL

Adai et al. J. Mol. Biol. 340, 179 (2004)

(1) separate original network into connected sets

(2) generate coordinates for each node in each connected set

(using force-directed layout algorithm and a recipe for the sequential lay out of

nodes guided by a minimum spanning tree of the network).

(3) integrate connected sets into one coordinate system via a funnel process:

the connected sets are sorted in descending size by the number of vertices.

The first connected set is placed at the bottom of a potential funnel and other

sets are placed one at a time on the rim of the potential funnel and allowed to

fall towards the bottom where they are frozen in space upon collision with the

previous sets.

We concentrate on step (2) in the following

Page 13: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 13

Minimum Spanning Tree

Given: undirected graph G = (V,E)

where for each edge (u,v) E

exists a weight w(u,v) specifying

the cost to connect u and v.

Find an acyclic graph T E that

connects all of the nodes and

whose total weight

is minimized.

Tvu

vuwTw,

,

Popular algorithms by Kruskal and Prim.

Both are greedy algorithms making the

best choice at the moment.

no guarantee to find the best global

solution

[Cormen]

Page 14: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 14

Kruskal’s algorithm

Consider edges in sorted order by weight.

The arrow points to the edge under consideration at each step.

[Cormen]

Page 15: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 15

Kruskal’s algorithm (II)

Running time O(E log V)

[Cormen]

Page 16: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 16

Intuitive description of LGL

Adai et al. J. Mol. Biol. 340, 179 (2004)

Successive iterations of the layout. The MST determines the oder of placement of

the nodes. The root node could be chosen randomly or based on its centrality in the

network (e.g. minimizing the sum of distances to all other nodes). All other nodes

are assigned a level according to their edge-based distance in the MST from the

root node.

Level one vertices (red circles) are placed randomly on a sphere around the root

node (black circle). The system is allowed to iterate through time satisfying attractive

and repulsive forces until at rest.

Level two nodes (blue circles) are placed randomly on spheres directed away from

the current layout. Again, the system is allowed to evolve through time till at rest.

This process is iterated for the entire graph.

Page 17: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 17

What is the role of fusion proteins?

Adai et al. J. Mol. Biol. 340, 179 (2004)

A protein homology map summarizes the results of billions of sequence comparisons by modeling

the proteins as vertices in a network, and the statistically significant sequence similarities as edges

connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A

″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map.

Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of

the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications,

which are responsible for producing the modern-day genes. The map simultaneously represents

homology relationships (edges), remote homologies (proteins not directly connected but in the same

cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by

fusion proteins).

Page 18: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 18

LGL Algorithm for very large biological networks

Adai et al. J. Mol. Biol. 340, 179 (2004)

The complete protein homology map. A layout of the entire protein homology

map; a total of 11,516 connected sets containing 111,604 proteins (vertices)

with 1,912,684 edges. The largest connected set is shown more clearly in the

inset and is enlarged further on the right side.

Page 19: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 19

Map of gene function

Adai et al. J. Mol. Biol. 340, 179 (2004)

emerges from ~21 billion gene sequence

comparisons. Proteins are drawn as points, with

lines connecting proteins with similar sequences,

and are arranged so that homologous proteins

are adjacent in the Figure.

The size of each cluster is proportional to the

number of proteins in that sequence family.

Fusion proteins force their component proteins'

respective families to be close together in the

Figure, and thereby serve to organize the

proteins in the map according to their functions.

The resulting broad trends of protein function are

labeled, as are several of the most extensive

sequence families. A–C indicate specific regions

that are magnified later.

Only the greatest connected network

component is drawn, containing 30,727

proteins (vertices) and 1,206,654

significant sequence similarities (edges),

and representing ~4 billion sequence

comparisons.

Page 20: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 20

Functionally related gene families form adjacent clusters

Adai et al. J. Mol. Biol. 340, 179 (2004)

Three examples illustrate spatial

localization of protein function in the map,

specifically

A, the linkage of the tryptophan synthase

family to the functionally coupled but non-

homologous family by the yeast

tryptophan synthase fusion protein,

B, protein subunits of the pyruvate

synthase and alpha-ketoglutarate

ferredexin oxidoreductase complexes

C, metabolic enzymes, particularly those of

acetyl CoA and amino acid metabolism.

Page 21: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 21

Colocalization

Adai et al. J. Mol. Biol. 340, 179 (2004)

Neighboring proteins tend to be in the

same cellular system. The tendency

for proteins to operate in the same

cellular system, as defined by the

percentage of matching assignments

into the 18 COG database pathways,

is plotted against the spatial

separation in multiples of a typical

cluster size.

The functional similarity decays

exponentially with distance

proportional to the function e−0.26d

where d is a typical cluster diameter.

Page 22: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 22

Comparison with other layout maps

Adai et al. J. Mol. Biol. 340, 179 (2004)

A comparison of LGL with map layouts

produced by other algorithms. The layout of

the protein homology map by LGL (A) is

contrasted with the layout of the same

network by the spring-force algorithm only,

lacking the minimal spanning tree

calculation and iterative layout procedure

(B), and with the layout by the approach of

InterViewer (C). Interviewer

collapses equivalent nodes into single

nodes, thereby simplifying the graph, and is

one of the few available graph layout

programs that scales to such large

networks. The layout from LGL reveals

more of the internal graph structure than

the other approaches tested.

Page 23: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 23

Modularity in molecular networks?

A functional module is, by definition, a discrete entity whose function is

separable from those of other modules.

This separation depends on chemical isolation, which can originate from

spatial localization or from chemical specificity.

E.g. a ribosome concentrates the reactions involved in making a polypeptide

into a single particle, thus spatially isolating its function.

A signal transduction system is an extended module that achieves its isolation

through the specificity of the initial binding of the chemical signal to receptor

proteins, and of the interactions between signalling proteins within the cell.

Hartwell et al. Nature 402, C47 (1999)

Page 24: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 24

Modularity in molecular networks

Modules can be insulated from or connected to each other.

Insulation allows the cell to carry out many diverse reactions without cross-talk

that would harm the cell.

Connectivity allows one function to influence another.

The higher-level properties of cells, such as their ability to integrate information

from multiple sources, will be described by the pattern of connections among their

functional modules.

Hartwell et al. Nature 402, C47 (1999)

Page 25: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 25

Organization of large-scale molecular networks

Organization of molecular networks revealed by large-scale experiments:

- power-law distribution ; P(k) exp-

- similar distribution of the node degree k (i.e. the number of edges of a node)

- small-world property (i.e. a high clustering coefficient and a small shortest path

between every pair of nodes)

- anticorrelation in the node degree of connected nodes (i.e. highly interacting

nodes tend to be connected to low-interacting ones)

These properties become evident when hundreds or thousands of molecules and

their interactions are studied together.

On the other end of the spectrum: recently discovered motifs that consist of 3-4

nodes.

Page 26: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 26

Mesoscale properties of networks

Most relevant processes in biological networks correspond to the mesoscale

(5-25 genes or proteins) not to the entire network.

However, it is computationally enormously expensive to study mesoscale

properties of biological networks.

e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.

Spirin & Mirny analyzed combined network of protein interactions with data from

CELLZOME, MIPS, BIND: 6500 interactions.

Page 27: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 27

Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph

with proteins as nodes and protein interactions as undirected edges.

Aim: identify highly connected subgraphs (clusters) that have more interactions

within themselves and fewer with the rest of the graph.

A fully connected subgraph, or clique, that is not a part of any other clique is an

example of such a cluster.

In general, clusters need not to be fully connected.

Measure density of connections by

where n is the number of proteins in the cluster

and m is the number of interactions between them.

Spirin, Mirny, PNAS 100, 12123 (2003)

12

nn

mQ

Page 28: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 28

(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.

Because the protein interaction graph is sofar very sparse (the number of interactions

(edges) is similar to the number of proteins (nodes), this can be done quickly.

To find cliques of size n one needs to enumerate only the cliques of size n-1.

The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500

protein interactions) successively.

For every pair A-B and C-D check whether there are edges between A and C, A and

D, B and C, and B and D. If these edges are present, ABCD is a clique.

For every clique identified, ABCD, pick all known proteins successively.

For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,

then ABCDE is a clique with size 5.

Continue for n = 6, 7, ... The largest clique found in the protein-interaction network

has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)

Page 29: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 29

(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.

For example, the clique with size 14 contains 14 cliques with size 13.

To find all nonredundant subgraphs, mark all proteins comprising the clique of size

14, and out of all subgraphs of size 13 pick those that have at least one protein

other than marked.

After all redundant cliques of size 13 are removed, proceed to remove redundant

twelves etc.

In total, only 41 nonredundant cliques with sizes 4 - 14 were found.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 30: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 30

(method II) Superparamagnetic Clustering (SPC)

SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic

model to find tightly connected clusters on a large graph.

Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q.

The value of this spin variable Si performs thermal fluctuations, which are

determined by the temperature T and the spin values on the neighboring nodes.

Energetically, 2 nodes connected by an edge are favored to have the same spin

value. Therefore, the spin at each node tends to align itself with the majority of its

neighbors.

When such a Potts spin system reaches equilibrium for a given temperature T,

high correlation between fluctuating Si and Sj at nodes i and j would indicate that

nodes i and j belong to the same cluster.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 31: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 31

(II) Superparamagnetic Clustering (SPC)The protein-interaction network is represented by a graph where every pair of

interacting proteins is an edge of length 1.

The simulations are run for temperatures ranging from 0 to 1 in units of the

coupling strength.

The network splits two monomers at temperatures between 0.7 and 0.8,

whereas larger clusters only exist for temperatures between 0.1 and 0.7.

Clusters are recorded at all values temperature.

The overlapping clusters are then merged and redundant ones are removed.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 32: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 32

(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.

At time t = 0, a random set of M nodes is selected.

For each pair of nodes i,j from this set, the shortest path Lij between i and j on the

graph is calculated.

Denote the sum of all shortest paths Lij from this set as L0.

At every time step one of M nodes is picked at random, and one node is picked at

random out of all its neighbors.

The new sum of all shortest paths, L1, is calculated if the original node were to be

replaced by this neighbor.

If L1 < L0, accept replacement with probability 1.

If L1 > L0, accept replacement with probability

where T is the effective temperature.

Spirin, Mirny, PNAS 100, 12123 (2003)

T

LL 01

exp

Page 33: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 33

(III) Monte Carlo Simulation

Every tenth time step an attempt is made to replace one of the nodes from

the current set with a node that has no edges to the current set to avoid

getting caught in an isolated disconnected subgraph.

This process is repeated

(i) until the original set converges to a complete subgraph, or

(ii) for a predetermined number of steps,

after which the tightest subgraph (the subgraph corresponding to the smallest

L0) is recorded.

The recorded clusters are merged and redundant clusters are removed.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 34: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 34

Optimal temperature in MC simulationFor every cluster size there is an

optimal temperature that gives the

fastest convergence to the tightest

subgraph.

Spirin, Mirny, PNAS 100, 12123 (2003)

Time to find a clique with size 7 in MC steps

per site as a function of temperature T.

The region with optimal temperature is

shown in Inset.

The required time increases sharply as the

temperature goes to 0, but has a relatively

wide plateau in the region 3 < T < 7.

Simulations suggest that the choice of

temperature T M would be safe for any

cluster size M.

Page 35: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 35

Comparison of clusters found with

SPC (blue) and MC simulation

(red).

Reasonable overlap (ca. one third

of all clusters are found by both

methods) – but both methods

seem complementary.

Spirin, Mirny, PNAS 100, 12123 (2003)

Comparison of SPC and Monte Carlo methods

Page 36: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 36

The SPC method is best at detecting high-Q value clusters with relatively few links

with the outside world. An example is the TRAPP complex, a fully connected clique

of size 10 with just 7 links with outside proteins.

This cluster was perfectly detected by SPC, whereas the MC simulation was able to

find smaller pieces of this cluster separately rather than the whole cluster.

By contrast, MC simulations are better suited for finding very „outgoing“ cliques.

The Lsm complex, a clique of size 11, includes 3 proteins with more interactions

outside the complex than inside. This complex was easily found by MC, but was not

detected as a stand-alone cluster by SPC.

Spirin, Mirny, PNAS 100, 12123 (2003)

Comparison of SPC and Monte Carlo methods

Page 37: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 37

Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are

statistically insignificant. Clean such statistically insignificant members first.

Then merge overlapping clusters:

For every cluster Ai find all clusters Ak that overlap with this cluster by at least one

protein.

For every such found cluster calculate Q value of a possible merged cluster

Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.

After the best match is found for every cluster, every cluster Ai is replaced by a

merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value

for QC.

This process continues until there are no more overlapping clusters or until merging

any of the remaining clusters witll make a cluster with Q value lower than QC.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 38: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 38

Statistical significance of complexes and modules

Number of complete cliques (Q = 1) as

a function of clique size enumerated in

the network of protein interactions

(red) and in randomly rewired graphs

(blue, averaged >1,000 graphs where

number of interactions for each protein

is preserved).

Inset shows the same plot in log-

normal scale. Note the dramatic

enrichment in the number of cliques in

the protein-interaction graph

compared with the random graphs.

Most of these cliques are parts of

bigger complexes and modules.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 39: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 39

Statistical significance of complexes and modules

Spirin, Mirny, PNAS 100, 12123 (2003)

Distribution of Q of clusters found by the MC search

method.

Red bars: original network of protein interactions.

Blue cuves: randomly rewired graphs.

Clusters in the protein network have many more

interactions than their counterparts in the random

graphs.

Page 40: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 40

Architecture of protein network

Fragment of the protein network. Nodes

and interactions in discovered clusters

are shown in bold. Nodes are colored by

functional categories in MIPS:

red, transcription regulation;

blue, cell-cycle/cell-fate control;

green, RNA processing; and

yellow, protein transport.

Complexes shown are the SAGA/TFIID

complex (red), the anaphase-promoting

complex (blue), and the TRAPP complex

(yellow).

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 41: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 41

Discovered functional modules

Spirin, Mirny, PNAS 100, 12123 (2003)

Examples of discovered functional modules.

(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and

CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).

Although they have many interactions, these proteins are not present in the cell at the same

time.

(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This

module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-

activated protein kinase kinase) kinases, as well as other proteins involved in signal

transduction. These proteins do not form a single complex; rather, they interact in a specific

order.

Page 42: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 42

Architecture of protein networkComparison of discovered complexes and

modules with complexes derived

experimentally (BIND and Cellzome) and

complexes catalogued in MIPS.

Discovered complexes are sorted by the

overlap with the best-matching experimental

complex. The overlap is defined as the

number of common proteins divided by the

number of proteins in the best-matching

experimental complex.

The first 31 complexes match exactly, and

another 11 have overlap above 65%.

Inset shows the overlap as a function of the

size of the discovered complex. Note that

discovered complexes of all sizes match very

well with known experimental complexes.

Discovered complexes that do not match with

experimental ones constitute our predictions.

Spirin, Mirny, PNAS 100, 12123 (2003)

Page 43: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 43

Robustness of clusters found

Model effect of false positives in

experimental data: randomly reconnect,

remove or add 10-50% of interactions

in network.

Cluster recovery probability as a

function of the fraction of altered links.

Black curves correspond to the case

when a fraction of links are rewired.

Red, removed;

green, added.

Circles represent the probability to

recover 75% of the original cluster;

triangles represent the probability to

recover 50%.

Spirin, Mirny, PNAS 100, 12123 (2003)

Noise in the form of removal or addions lf

links has less deteriorating effect than

random rewiring. About 75% of clusters

can still be found when 10% of links are

rewired.

Page 44: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 44

Summary

Here: analysis of meso-scale properties demonstrated the presence of highly

connected clusters of proteins in a network of protein interactions. Strong support

for suggested modular architecture of biological networks.

Distinguish 2 types of clusters: protein complexes and dynamic functional modules.

Both complexes and modules have more interactions among their members than

with the rest of the network.

Dynamic modules are elusive to experimental purification because they are not

assembled as a complex at any single point in time.

Computational analysis allows detection of such modules by integrating pairwise

molecular interactions that occur at different times and places.

However, computational analysis alone, does not allow to distinguish between

complexes and modules or between transient and simultaneous interactions.

Page 45: Graph Layout in Cellular Networks

9. Lecture WS 2004/05

Bioinformatics III 45

Summary

Most of the discovered complexes and modules come from traditional studies,

rather than from large-scale experiments.

This suggests that although large-scale proteomic studies provide a wealth of

protein interaction data, the scarcity of the data (and its comtamination with false

positives) makes such studies less valuable for identification of functional modules.