v8: structure of cellular networks

8. Lecture SS 20005

Cell Simulations 1

V8: Structure of Cellular Networks

Today: Dynamic Simulation of Protein Complex Formation on a Genomic Scale

- Most cellular functions are conducted or regulated by protein complexes of

varying size- organization into complexes may contribute substantially to an organism‘s

complexity.

E.g. 6000 different proteins (yeast) may form 18 106 different pairs of

interacting proteins, but already 1011 different complexes of size 3.

mechanism how evolution could significantly increase the regulatory and

metabolic complexity of organisms without substantially increasing the genome

size.

- Only a very small subset of the many possible complexes is actually realized.

Beyer, Wilhelm, Bioinformatics

8. Lecture SS 20005

Cell Simulations 2

Review: Yeast protein interaction network:first example of a scale-free network

A map of protein–protein interactions in

Saccharomyces cerevisiae, which is

based on early yeast two-hybrid

measurements, illustrates that a few

highly connected nodes (which are also

known as hubs) hold the network

together.

The largest cluster, which contains

78% of all proteins, is shown. The colour

of a node indicates the phenotypic effect

of removing the corresponding protein

(red = lethal, green = non-lethal, orange

= slow growth, yellow = unknown).

Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)

8. Lecture SS 20005

Cell Simulations 3

Review: Systematic identication of large protein complexes by tandem affinity purification

Yeast 2-Hybrid-method can only identify binary complexes.

Cellzome company: attach additional protein P to particular protein Pi ,

P binds to matrix of purification column.

yields Pi and proteins Pk bound to Pi .

Gavin et al. Nature 415, 141 (2002)

Identify proteinsby mass spectro-metry (MALDI-TOF).

8. Lecture SS 20005

Cell Simulations 4

BDIM models: birth, death and innovation

Manually curated set of 229 biologically meaningful ‚TAP complexes‘ from yeast

with sizes ranging from 2 to 88 different proteins per complex.

„Cumulative“ means that

there are 229 complexes

of size 2 that may also be

parts of larger complexes.

8. Lecture SS 20005

Cell Simulations 5

Frequency of complexes in experiments

Large-scale experiments size-frequency of complexes has common

characteristics:

# of complexes of a given size versus complex size is exponentially

decreasing

Does the shape of this distribution reflect the nature of the underlying

cellular dynamics which is creating the protein complexes?

Test by simulation model

8. Lecture SS 20005

Cell Simulations 6

Protein Abundance Data

Abundance of 6200 yeast proteins:

....

Beyer et al. (2004) compiled a protein abundance data set for yeast under standard conditions in YPD-medium. Based on this data set we derived a distribution of protein abundances that resembles the characteristics of the measured data in the upper range (Figure S2). For approximately 2000 proteins no abundance values are available. We assume that the undetected proteins primarily belong to the low-abundance classes, which gives rise to the hypothetical distribution.

8. Lecture SS 20005

Cell Simulations 7

Dynamic Complex Formation Model

3 variants of the protein complex association-dissociation model (PAD-model) are

tested with the following features:

(i) In all 3 versions the composition of the proteome does not change with time.

Degradation of proteins is always balanced by an equal production of the same

kind of proteins.

(ii) The cell consists of either one (PAD A & B) or several (PAD C) compartments

in which proteins and protein complexes can freely interact with each other. Thus,

all proteins can potentially bind to all other proteins in their compartment.

(iii) Association and dissociation rate constants are the same for all proteins.

In PAD-models A and C association and dissociation are independent of complex

size and complex structure.

8. Lecture SS 20005

Cell Simulations 8

Dynamic Complex Formation Model

(iv) At each time step a set of complexes is randomly selected to undergo

association and dissociation.

Association is simulated as the creation of new complexes by the binding of two

smaller complexes and dissociation is simulated as the reverse process, i.e. it is

the decay of a complex into two smaller complexes.

The number of associations and dissociations per time step is

ka · NC 2 and kd · NC respectively,

NC : total number of complexes in the cell

ka [1/(#complexes · time)] : association rate constant

kd [1/time] : dissociation rate constant.

ka and kd are mathematically equivalent to biochemical rates of a reversible

reaction.

8. Lecture SS 20005

Cell Simulations 9

Protein Association/Dissociation Models

PAD A : the most simple model where all proteins can interact with each other

(no partitioning) and it assumes that association and dissociation are independent

of complex size.

PAD B : is equivalent to PAD A, except for the assumption that larger complexes

are more likely to bind (preferential attachment). In this case we assume that

the binding probability is proportional to i·j, where i and j are the sizes of two

potentially interacting complexes.

PAD C : extends PAD A by assuming that proteins can interact only within

groups of proteins (with partitioning). The sizes of these protein groups are

based on the sizes of first level functional modules according to the yeast data

base. PAD C assumes 16 modules each containing between 100 and 1000

different ORFs. Hence, the protein groups do not represent physical

compartments, but rather resemble functional modules of interacting proteins.

8. Lecture SS 20005

Cell Simulations 10

Mathematical Description

Since explicit simulation of an entire cell (50 million protein molecules were

simulated) is too time consuming for many applications of the model, we also

developed a mathematical description of the PAD model, which allows us to more

quickly assess different scenarios and parameter combinations.

The change of the number of complexes of size i, xi, during one time step t can

be described asdi

ai

di

ai

i LLGGt

x

Gia and Gi

d : gains due to association and dissociationL i

a and Lid : losses due to association and dissociation

(1)

8. Lecture SS 20005

Cell Simulations 11


Given a total number of NC complexes, the total number of associations and

dissociations per time step are ka · NC2 and k d · NC, respectively.

We assume throughout that we can calculate the mean number of associating or

dissociating complexes of size i per time step as

2 · ka · xi · NC and kd · xi.

The probability that complexes of size j and i-j get selected for one association is

deduce the number of complexes of size i that get created during each time

step via association of smaller complexes simply by summing over all complex

sizes that potentially create a complex of size i:

1

C

ji

C

j

N

x

N

x

1

C

ijijij

aai N

xx

NkG

8. Lecture SS 20005

Cell Simulations 12


When j is equal to i/2 (which is possible only for even i’s) both interaction partners

have the same size. The size of the pool xi-j is therefore reduced by 1 after the

first interaction partner has been selected, which yields a small reduction of the

probability of selecting a second complex from that pool. We account for this

effect with the correction i, which only applies to even i’s:

else0

evenif2 ixii

This correction is usually very small.

The loss of complexes of size i due to association is simply proportional to the

probability of selecting them for association, i.e.

Ciaai NxkL 2

8. Lecture SS 20005

Cell Simulations 13


Complexes of size i get created by dissociation of larger complexes. A complex

of size j has

possible ways of dissociation and the number of possible fragments of size i is

The probability that a dissociating complex of size j > i creates a fragment of size

i is hence

12 1 jjN

i

j

Ni

j

The number of new complexes follows by summing over all possible parent sizes

i

j

N

xkG

ij j

jd

di

The respective loss term becomes

iddi xkL

8. Lecture SS 20005

Cell Simulations 14


Figure S1 shows a comparison of a numerical solution of equation (1) with a

stochastic simulation of the association-dissociation process.

8. Lecture SS 20005

Cell Simulations 15


After a transient period a steady-state is reached. We are mainly interested in this

steady-state distribution of frequencies xi.

find a set of xi solving xi/t = 0.

The solution of this non-linear equation system is obtained by numerically

minimizing all xi /t.

By dividing equation (1) by kd it can be seen that the steady-state distribution is

independent of the absolute values of ka and kd, but it only depends on the ratio of

the two parameters Rad = ka / kd.

Hence, only two parameters affect the xi at steady-state:

- the total number of proteins NP (which indirectly determines NC) and

- the ratio of the two rate constants Rad.

8. Lecture SS 20005

Cell Simulations 16


For PAD-model B the dissociation terms remain unchanged, wheras the association

terms have to be modified.

In case of PAD C we calculated weighted averages of results obtained with PAD A.

Assume that association is proportional to the product of the sizes of the

participating complexes. This assumption changes equation (2) to:

n

k

n

llk

n

kkiC

ijijij

Caai

ai

xxlkSc

xkxiN

xxini

cNkLG

1 1

1

constantionnormalizatawith

21

where n is the maximum complex size and

else0

evenif4 2

2

ixi

ii

8. Lecture SS 20005

Cell Simulations 17

Measurable Size Distribution and Bait Selection

Based on the distribution resulting from equation (1) at steady-state we derive two further

distributions: (i) the ‘measurable size distribution’ and (ii) the ‘bait distribution’. The former is

defined as the frequency distribution of the measurable complex sizes.

The measurable complex size is the

number of different proteins in a

protein complex (as opposed to the

total number of proteins).

For the measurable size-distribution

we only count the number of

complexes with distinct protein

compositions.

Measurable versus ‘actual’ complex size distribution. Diamonds show frequencies of actual complex sizes and triangles are frequencies of measurable complexes. Filled diamonds and triangles reflect simulation without partitioning (PAD A) and open diamonds and triangles are simulation results assuming binding only within certain modules (PAD C). The difference between the original and the measurable complex size distribution is comparably small, because most of the simulated complexes are unique. However, in case of PAD C smaller complexes occur at higher copy numbers and larger complexes are often counted as smaller measurable complexes because they contain some proteins more than once.

8. Lecture SS 20005

Cell Simulations 18

Computation of a Dissociation Constant KD

Mathematically our model describes a reversible (bio-)chemical reaction.

calculate an equilibrium dissociation constant KD, which quantifies the fraction

of free subcomplexes A and B compared to the bound complex AB.

This equilibrium is complex size dependent, because a large complex AB is less

likely to randomly dissociate exactly into the two specific subunits A and B than a

small complex. (A and B can be ensembles of several proteins.)

We get for any given complex of size i the following KD:

KD (i) = [A][B] / [AB] = (Rad ·Ni · V) – 1 (4)

where Ni is the number of possible fragments of a complex of size i and V is the

cell volume. Cell-wide averages of KD -values are estimated by computing a

weighted average

with NC being the total number of complexes and xi being the number of

complexes of size i.

i C

iDD N

xiKK

8. Lecture SS 20005

Cell Simulations 19

Biochemical Interpretation of the Rate Constants

The process of forming a protein complex AB from the two subcomplexes A and

B, and its dissociation can be described as a reversible reaction:

ABBA with constants kon [L/(mol s)] and koff [1/s] quantifying the forward and backward

reactions: ABkBAkdt

ABdoffon

In our model the concentration [A] can be calculated as

with fA being the fraction of species A among all NC complexes in the system and

V being the cell volume.

V

Nf CA

8. Lecture SS 20005

Cell Simulations 20


The number of associations of two complex-species A and B per time step

becomes

BAVkffV

Nk

NN

nn

V

Nk

V

NBA

ca

cC

BACa

assocBA

22

,

1

since we assume ka·NC2 many associations per time step.

Here, nA and nB are the number of complexes of the respective species.

Division by the cell-volume V yields units of ‘concentration per time’.

Thus, kon in a biochemical reaction approximately equals ka ·V, since the total

number of complexes NC is very large in all scenarios that we have simulated.

8. Lecture SS 20005

Cell Simulations 21


When looking for an equivalent expression for koff we have to quantify the specific

dissociation of a complex AB into the subcomplexes A and B.

The unspecific dissociation of AB is simply kd ·[AB],

kd : dissociation rate constant.

Since AB may consist of > 2 proteins it can also be split into subcomplexes other

than A and B. For the specific dissociation rate, one has to know how often AB

actually dissociates into the subcomplexes A and B.

The total number of dissociations per time step is kd · NC. The probability that a

complex AB with size i breaks into the specific sub-complexes A and B is 1/Ni,

Ni : number of possible fragments of a complex of size i.

This holds under the assumption that all proteins in AB are distinct, which is

approximately true for the simulations conducted here.

8. Lecture SS 20005

Cell Simulations 22


nAB/NC : fraction of complexes AB among all complexes

size specific dissociation rate N AB dissoc (i): AB

N

k

N

n

NV

Nk

V

iN

i

d

C

AB

i

Cd

dissocBA

,

from which the complex size dependent rate constant koff.(i) = kd/Ni results.

Taking into account that certain proteins may be in the complex more than once

we get koff = kd/Ni.

One can calculate an apparent equilibrium constant KD, which describes the

equilibrium between the independent species A and B and the bound species AB:

VNk

k

k

k

AB

BAiK

ia

d

on

offD

where i is the size of the complex AB.

Since Ni is exponentially increasing with i, KD is exponentially decreasing

with complex size.

8. Lecture SS 20005

Cell Simulations 23

Results

We dynamically simulated the association and

dissociation of 6200 different protein types

yielding a set of about 50 million protein

molecules. Subsequently we analyzed the

resulting steady-state size distribution of

protein complexes. This steady-state is

thought to reflect the log-growth conditions

under which the yeast cells were held when

TAP-measuring the protein complexes (Gavin

et al., 2002). Based on measured protein

complex data (Gavin et al., 2002) we

calculated a protein complex size distribution

to which we can compare the simulation

results (Figure 1).

8. Lecture SS 20005

Cell Simulations 24

Results

The TAP measurements do not provide concentrations of the measured

complexes, but they only demonstrate the presence of a certain protein complex

in yeast cells. In addition, also the number of proteins of a certain type inside

such a complex could not be measured. Hence, the complex size from Figure 1

does not represent real complex sizes (i.e. total number of proteins in the

complex), but it refers to the number of different proteins in a complex.

The measured data reflect the characteristics of only 229 different protein

complexes of size two and larger, which is just a small subset of the

‘complexosome’. These peculiarities have to be taken into account when

comparing simulation results to the observed complex size distribution. We refer

to the ‘measurable complex size’ as the number of distinct proteins in a protein

complex (Figure 2). When comparing our simulation results to the measurements,

we always select a random-subset of 229 different complexes from the simulated

pool of complexes. This results in a complex size distribution comparable to the

measured distribution from Figure 1 (‘bait distribution’).

8. Lecture SS 20005

Cell Simulations 25

Effect of preferential attachment

Cumulative number of distinct protein

complexes versus their size, resulting

from simulations without (diamonds) and

with (squares) preferential attachment to

larger complexes. Both simulations are

performed with the best fit parameters

for PAD A. In case of preferential

attachment the best regression result

(solid line) is obtained with a power-law,

while the simulation without preferential

attachment is best fitted assuming an

exponentially decreasing curve. The

original, measurable and bait

distributions are always close to

exponential in case of PAD A and

power-law like in case of PAD B,

independent of the parameters chosen.

PAD B model gives power-lawdistribution not in agreementwith experimental observation.

8. Lecture SS 20005

Cell Simulations 26

Conclusions

- very simple, dynamic model can reproduce the observed complex size

distribution. Given the small number of input parameters the very good fit of the

observed data is astonishing.

Conclusion 1 preferential attachment does not take place in yeast cells under

the investigated conditions. This is biologically plausible: Specific and strong

binding can be just as important for small protein complexes as for large

complexes.

the dissociation should on average be independent of the complex size. The

interpretation of the simulated association and dissociation in terms of KD-values

suggests that larger complexes bind more strongly than smaller complexes.

However, the size dependence of KD is compensated by the higher number of

possible dissociations in larger complexes.

We always assume that all possible dissociations happen with the same

probability. In reality large complexes may break into specific subcomplexes,

which subsequently can be re-used for a different purpose.

Improved versions of the model should account for specificity of association

and for specific dissociation.

8. Lecture SS 20005

Cell Simulations 27

Conclusions

Conclusion 2 the number of complexes that were missed during the TAP

measurements is potentially large. Simulations give an upper limit of the number

of different complexes in cells.

At a first glance, the number of different complexes in PAD A (> 3.5 mill.) and

PAD C (~ 2 mill.) may appear to be far too large. Even PAD C may overestimate

the true number of different complexes, because association within the groups is

unrestricted. However, the PAD-models do not only simulate functional, mature

complexes, but they also consider all intermediate steps. Each of these steps is

counted as a different protein complex. The large difference between the number

of measured complexes and the (potential) number of existing complexes may

partly explain the very small overlap that has been observed between different

large scale measurements of protein complexes. A correct interpretation of the

kinetic parameters is important. First of all, ka and kd cannot be compared to real

numbers, because the model does not define a length of the time steps for

interpreting ka and kd as actual rate constants. In addition, the association-to-

dissociation ratio Rad is not identical to a physical KD-value obtained by in vitro

measurements of protein binding in water solutions.

8. Lecture SS 20005

Cell Simulations 28

Discussion

Several reasons do not allow for this simple interpretation:

(i) In vivo diffusion rates are below those in water due to the high concentration of

proteins and other large molecules in the cytosol.

(ii) Most proteins either are synthesized where they are needed or they get

transported directly to the site where the complex gets compiled. Hence,

transport to the site of action is on average faster than random diffusion.

(iii) Protein concentrations are often above the cell average due to the

compartmentalization of the cell. All these processes (protein production,

transport, and degradation) are not explicitly described in the PAD-model, but

they are lumped in our assumptions.

The Rad (and the KD derived from it) must therefore be interpreted as an

operationally defined property. It characterizes the overall, cell averaged complex

assembly process, which includes all steps necessary to synthesize a protein

complex.

8. Lecture SS 20005

Cell Simulations 29

Discussion

However, even the model-derived KD-s allow for some conclusions regarding

complex formation. We calculated weighted averages (KD ) of the size-dependent

KD -values by using the steady-state complex size distribution of the best fit.

This yields average KD -s of 4.7 nM and 0.18 nM for the best fits of PAD A and

PAD C, respectively. First, the fact that the KD for PAD C is below that of PAD A

underlines the notion, that more specific binding is reflected by smaller KD values.

Second, typical in vitro KD–values are above 1 nM, thus the average KD of PAD C

is comparably low.

The model therefore confirms that protein complex formation in vivo gets

accelerated due to directed protein transport and due to the compartmentalization

of eukaryotes. It is a surprising finding though, that important aspects of these

highly regulated protein synthesis and transport processes can on average be

described by a simple compartment model assuming random association and

dissociation. Large scale protein-protein interaction data sets are subject to

substantial error, resulting in a potentially large number of false positives and false

negatives.

8. Lecture SS 20005

Cell Simulations 30

Possible Limitations

In order to get a correct picture of the protein complex size distribution it is

necessary to have an unbiased, random subset of all complexes in the cells.

TAP data are biased, e.g. contain too few membrane proteins. However, if

compared to other data sets such as MIPS complexes, the TAP complexes

constitute a fairly random selection of all protein complexes in yeast.

Uncertainties in the TAP data do not affect our conclusions as long as they are

not strongly biased with respect to the resulting complex size distribution.

Since Gavin et al. (2002) have measured long-term interactions, our results apply

to permanent complexes. Yet the model is applicable to future protein complex

data that take account of transient binding.

8. Lecture SS 20005

Cell Simulations 31

Discussion

The simulated complex size distribution is almost independent of the assumed

protein abundance distribution.

PP is a valuable summarizing property that can be used to characterize

proteomes of different species. A decreasing PP increases the number of different

large complexes (the slope in Table 1 gets more shallow), because it is less likely

that a large complex contains the same protein twice.

Thus, PP is a measure of complexity that not only relates to the diversity of the

proteome but also to the composition of protein complexes.

Probably the most severe simplification in our model is the assumption that all

proteins can potentially interact with each other. The PAD-model C is a first step

towards more biological realism. By restricting the number of potential interaction

partners it more closely maps functional modules and cell compartments, which

both restrict the interaction among proteins.

8. Lecture SS 20005

Cell Simulations 32

Further improvements

The partitioning in PAD C connotes that proteins within one group exhibit very

strong binding, whereas binding between protein groups is set to zero. This again

is a simplification, since cross-talk between different modules or compartments is

possible.

Future extensions of the model could incorporate more and more detailed

information about the binding specificity of proteins. Assuming even more specific

binding will further reduce the number of different complexes, whereas the

frequency of the complexes will increase. High binding specificity potentially

lowers the complex sizes, so Rad has to be increased in order to fit the

experimentally observed protein complex size distribution.

On the other hand, cross talk gives rise to larger complexes. Taking both

counteracting refinements into account, it is impossible to generally predict the

best-fit Rad, since it depends on the quantitative details.

8. Lecture SS 20005

Cell Simulations 33

Further improvements

A first additional refinement of PAD C could account for the observed clustering

of protein interaction networks.

In a second step one could simulate protein associations and dissociations

according to predefined binary protein interactions.

A most detailed model could additionally account for individual

association/dissociation rates between individual proteins. Such extensions will

yield more realistic figures about the number of different protein complexes

created in yeast cells.

However, starting model development with the most simple assumptions reveals

the most important characteristics of the system for reproducing the observations.

The very good match that we already obtain with the most simple model PAD A is

striking.

8. Lecture SS 20005

Cell Simulations 34

8. Lecture SS 20005

Cell Simulations 35

Measurable Size Distribution and Bait Selection

In order to determine the distribution of measurable complex sizes corresponding

to the steady-state distribution, we create a set of complexes according to the

original steady-state size distribution by randomly ‘filling’ the complexes with

proteins from the protein abundance distribution. We then compute the resulting

measurable size distribution. Results shown are averages of several of such

random sets.

The bait distribution is used to compare the simulation to the actual TAP

measurements. The bait distribution is obtained by randomly selecting and

subsequently analyzing a subset from all simulated complexes. We call that

distribution ‘bait distribution’, because the process of selecting a subset of all

complexes corresponds to selecting bait proteins for pulling out the complexes

during the measurements. We always select 229 different complexes, which is

the number of TAP complexes to which we

compare the simulations.

8. Lecture SS 20005

Cell Simulations 36

Only the Probability of Selecting the Same Protein Twice Matters

In a real protein abundance distribution each gene can potentially have a different

expression level. However, for our model it is sufficient to use averaged

properties of the protein abundance distribution.

Each protein complex can be viewed as a set of proteins that are independently

drawn from the pool of available proteins, because we assume that association is

independent of the particular proteins. The measurable complex size is the actual

complex size reduced by the number of proteins that are drawn twice or more

often. The measurable complex size is on average only affected by the probability

PP of pulling the same protein twice out of the pool of available

proteins.

This probability can be calculated by using the fact that (ni – 1)/ (NP – 1) is the

probability of selecting a second protein of the specific type i:

k

i

k

iiP

P

i

P

iP nN

N

n

N

nP

1 1

22

1

1

where NP is the total number of proteins in cell, k is the number of protein types

(or genes), and ni is the number of molecules of the respective protein type.

8. Lecture SS 20005

Cell Simulations 37

Computation of a Dissociation Constant KD

Theoretically one would have to separately account for proteins that are present 3

times or more often in one complex. However, the likelihood that a protein gets

selected for a third time is orders of magnitude below the probability to select a

protein twice. Therefore, the few cases when a protein gets selected more often

then twice are negligible. We have verified this statement by respective

simulations.

Additionally, the measurable complex size distribution counts only distinct

complexes, i.e. we have to reduce the complex frequency by those complexes

that appear more than once. In model PAD A, which assumes one large

compartment, it is very unlikely that a complex of size 3 appears more than once

due to the heterogeneity of the protein pool. Hence, in this case mainly the

frequency of complexes with size 2 (x2) is affected by counting only distinct

complexes. This x2 in turn depends on PP. Since model PAD C assumes more

homogenous interacting groups, the likelihood to find larger complexes more than

once increases. However, the larger homogeneity is reflected by a larger PP. In

summary, for the scenarios that we have simulated PP proved to be a very good

descriptor of the protein abundance distribution.

v8: structure of cellular networks

Documents

yeast proteins

different complexes

frequency of complexes

different proteins yeast

binary complexes

possible complexes

complex size

protein abundance data