v8: structure of cellular networks
DESCRIPTION
V8: Structure of Cellular Networks. Today : Dynamic Simulation of Protein Complex Formation on a Genomic Scale Most cellular functions are conducted or regulated by protein complexes of varying size organization into complexes may contribute substantially to an organism‘s complexity. - PowerPoint PPT PresentationTRANSCRIPT
8. Lecture SS 20005
Cell Simulations 1
V8: Structure of Cellular Networks
Today: Dynamic Simulation of Protein Complex Formation on a Genomic Scale
- Most cellular functions are conducted or regulated by protein complexes of
varying size- organization into complexes may contribute substantially to an organism‘s
complexity.
E.g. 6000 different proteins (yeast) may form 18 106 different pairs of
interacting proteins, but already 1011 different complexes of size 3.
mechanism how evolution could significantly increase the regulatory and
metabolic complexity of organisms without substantially increasing the genome
size.
- Only a very small subset of the many possible complexes is actually realized.
Beyer, Wilhelm, Bioinformatics
8. Lecture SS 20005
Cell Simulations 2
Review: Yeast protein interaction network:first example of a scale-free network
A map of protein–protein interactions in
Saccharomyces cerevisiae, which is
based on early yeast two-hybrid
measurements, illustrates that a few
highly connected nodes (which are also
known as hubs) hold the network
together.
The largest cluster, which contains
78% of all proteins, is shown. The colour
of a node indicates the phenotypic effect
of removing the corresponding protein
(red = lethal, green = non-lethal, orange
= slow growth, yellow = unknown).
Barabasi & Oltvai, Nature Rev Gen 5, 101 (2004)
8. Lecture SS 20005
Cell Simulations 3
Review: Systematic identication of large protein complexes by tandem affinity purification
Yeast 2-Hybrid-method can only identify binary complexes.
Cellzome company: attach additional protein P to particular protein Pi ,
P binds to matrix of purification column.
yields Pi and proteins Pk bound to Pi .
Gavin et al. Nature 415, 141 (2002)
Identify proteinsby mass spectro-metry (MALDI-TOF).
8. Lecture SS 20005
Cell Simulations 4
BDIM models: birth, death and innovation
Manually curated set of 229 biologically meaningful ‚TAP complexes‘ from yeast
with sizes ranging from 2 to 88 different proteins per complex.
„Cumulative“ means that
there are 229 complexes
of size 2 that may also be
parts of larger complexes.
8. Lecture SS 20005
Cell Simulations 5
Frequency of complexes in experiments
Large-scale experiments size-frequency of complexes has common
characteristics:
# of complexes of a given size versus complex size is exponentially
decreasing
Does the shape of this distribution reflect the nature of the underlying
cellular dynamics which is creating the protein complexes?
Test by simulation model
8. Lecture SS 20005
Cell Simulations 6
Protein Abundance Data
Abundance of 6200 yeast proteins:
....
Beyer et al. (2004) compiled a protein abundance data set for yeast under standard conditions in YPD-medium. Based on this data set we derived a distribution of protein abundances that resembles the characteristics of the measured data in the upper range (Figure S2). For approximately 2000 proteins no abundance values are available. We assume that the undetected proteins primarily belong to the low-abundance classes, which gives rise to the hypothetical distribution.
8. Lecture SS 20005
Cell Simulations 7
Dynamic Complex Formation Model
3 variants of the protein complex association-dissociation model (PAD-model) are
tested with the following features:
(i) In all 3 versions the composition of the proteome does not change with time.
Degradation of proteins is always balanced by an equal production of the same
kind of proteins.
(ii) The cell consists of either one (PAD A & B) or several (PAD C) compartments
in which proteins and protein complexes can freely interact with each other. Thus,
all proteins can potentially bind to all other proteins in their compartment.
(iii) Association and dissociation rate constants are the same for all proteins.
In PAD-models A and C association and dissociation are independent of complex
size and complex structure.
8. Lecture SS 20005
Cell Simulations 8
Dynamic Complex Formation Model
(iv) At each time step a set of complexes is randomly selected to undergo
association and dissociation.
Association is simulated as the creation of new complexes by the binding of two
smaller complexes and dissociation is simulated as the reverse process, i.e. it is
the decay of a complex into two smaller complexes.
The number of associations and dissociations per time step is
ka · NC 2 and kd · NC respectively,
NC : total number of complexes in the cell
ka [1/(#complexes · time)] : association rate constant
kd [1/time] : dissociation rate constant.
ka and kd are mathematically equivalent to biochemical rates of a reversible
reaction.
8. Lecture SS 20005
Cell Simulations 9
Protein Association/Dissociation Models
PAD A : the most simple model where all proteins can interact with each other
(no partitioning) and it assumes that association and dissociation are independent
of complex size.
PAD B : is equivalent to PAD A, except for the assumption that larger complexes
are more likely to bind (preferential attachment). In this case we assume that
the binding probability is proportional to i·j, where i and j are the sizes of two
potentially interacting complexes.
PAD C : extends PAD A by assuming that proteins can interact only within
groups of proteins (with partitioning). The sizes of these protein groups are
based on the sizes of first level functional modules according to the yeast data
base. PAD C assumes 16 modules each containing between 100 and 1000
different ORFs. Hence, the protein groups do not represent physical
compartments, but rather resemble functional modules of interacting proteins.
8. Lecture SS 20005
Cell Simulations 10
Mathematical Description
Since explicit simulation of an entire cell (50 million protein molecules were
simulated) is too time consuming for many applications of the model, we also
developed a mathematical description of the PAD model, which allows us to more
quickly assess different scenarios and parameter combinations.
The change of the number of complexes of size i, xi, during one time step t can
be described asdi
ai
di
ai
i LLGGt
x
Gia and Gi
d : gains due to association and dissociationL i
a and Lid : losses due to association and dissociation
(1)
8. Lecture SS 20005
Cell Simulations 11
Mathematical Description
Given a total number of NC complexes, the total number of associations and
dissociations per time step are ka · NC2 and k d · NC, respectively.
We assume throughout that we can calculate the mean number of associating or
dissociating complexes of size i per time step as
2 · ka · xi · NC and kd · xi.
The probability that complexes of size j and i-j get selected for one association is
deduce the number of complexes of size i that get created during each time
step via association of smaller complexes simply by summing over all complex
sizes that potentially create a complex of size i:
1
C
ji
C
j
N
x
N
x
1
C
ijijij
aai N
xx
NkG
8. Lecture SS 20005
Cell Simulations 12
Mathematical Description
When j is equal to i/2 (which is possible only for even i’s) both interaction partners
have the same size. The size of the pool xi-j is therefore reduced by 1 after the
first interaction partner has been selected, which yields a small reduction of the
probability of selecting a second complex from that pool. We account for this
effect with the correction i, which only applies to even i’s:
else0
evenif2 ixii
This correction is usually very small.
The loss of complexes of size i due to association is simply proportional to the
probability of selecting them for association, i.e.
Ciaai NxkL 2
8. Lecture SS 20005
Cell Simulations 13
Mathematical Description
Complexes of size i get created by dissociation of larger complexes. A complex
of size j has
possible ways of dissociation and the number of possible fragments of size i is
The probability that a dissociating complex of size j > i creates a fragment of size
i is hence
12 1 jjN
i
j
Ni
j
The number of new complexes follows by summing over all possible parent sizes
i
j
N
xkG
ij j
jd
di
The respective loss term becomes
iddi xkL
8. Lecture SS 20005
Cell Simulations 14
Mathematical Description
Figure S1 shows a comparison of a numerical solution of equation (1) with a
stochastic simulation of the association-dissociation process.
8. Lecture SS 20005
Cell Simulations 15
Mathematical Description
After a transient period a steady-state is reached. We are mainly interested in this
steady-state distribution of frequencies xi.
find a set of xi solving xi/t = 0.
The solution of this non-linear equation system is obtained by numerically
minimizing all xi /t.
By dividing equation (1) by kd it can be seen that the steady-state distribution is
independent of the absolute values of ka and kd, but it only depends on the ratio of
the two parameters Rad = ka / kd.
Hence, only two parameters affect the xi at steady-state:
- the total number of proteins NP (which indirectly determines NC) and
- the ratio of the two rate constants Rad.
8. Lecture SS 20005
Cell Simulations 16
Mathematical Description
For PAD-model B the dissociation terms remain unchanged, wheras the association
terms have to be modified.
In case of PAD C we calculated weighted averages of results obtained with PAD A.
Assume that association is proportional to the product of the sizes of the
participating complexes. This assumption changes equation (2) to:
n
k
n
llk
n
kkiC
ijijij
Caai
ai
xxlkSc
xkxiN
xxini
cNkLG
1 1
1
constantionnormalizatawith
21
where n is the maximum complex size and
else0
evenif4 2
2
ixi
ii
8. Lecture SS 20005
Cell Simulations 17
Measurable Size Distribution and Bait Selection
Based on the distribution resulting from equation (1) at steady-state we derive two further
distributions: (i) the ‘measurable size distribution’ and (ii) the ‘bait distribution’. The former is
defined as the frequency distribution of the measurable complex sizes.
The measurable complex size is the
number of different proteins in a
protein complex (as opposed to the
total number of proteins).
For the measurable size-distribution
we only count the number of
complexes with distinct protein
compositions.
Measurable versus ‘actual’ complex size distribution. Diamonds show frequencies of actual complex sizes and triangles are frequencies of measurable complexes. Filled diamonds and triangles reflect simulation without partitioning (PAD A) and open diamonds and triangles are simulation results assuming binding only within certain modules (PAD C). The difference between the original and the measurable complex size distribution is comparably small, because most of the simulated complexes are unique. However, in case of PAD C smaller complexes occur at higher copy numbers and larger complexes are often counted as smaller measurable complexes because they contain some proteins more than once.
8. Lecture SS 20005
Cell Simulations 18
Computation of a Dissociation Constant KD
Mathematically our model describes a reversible (bio-)chemical reaction.
calculate an equilibrium dissociation constant KD, which quantifies the fraction
of free subcomplexes A and B compared to the bound complex AB.
This equilibrium is complex size dependent, because a large complex AB is less
likely to randomly dissociate exactly into the two specific subunits A and B than a
small complex. (A and B can be ensembles of several proteins.)
We get for any given complex of size i the following KD:
KD (i) = [A][B] / [AB] = (Rad ·Ni · V) – 1 (4)
where Ni is the number of possible fragments of a complex of size i and V is the
cell volume. Cell-wide averages of KD -values are estimated by computing a
weighted average
with NC being the total number of complexes and xi being the number of
complexes of size i.
i C
iDD N
xiKK
8. Lecture SS 20005
Cell Simulations 19
Biochemical Interpretation of the Rate Constants
The process of forming a protein complex AB from the two subcomplexes A and
B, and its dissociation can be described as a reversible reaction:
ABBA with constants kon [L/(mol s)] and koff [1/s] quantifying the forward and backward
reactions: ABkBAkdt
ABdoffon
In our model the concentration [A] can be calculated as
with fA being the fraction of species A among all NC complexes in the system and
V being the cell volume.
V
Nf CA
8. Lecture SS 20005
Cell Simulations 20
Biochemical Interpretation of the Rate Constants
The number of associations of two complex-species A and B per time step
becomes
BAVkffV
Nk
NN
nn
V
Nk
V
NBA
ca
cC
BACa
assocBA
22
,
1
since we assume ka·NC2 many associations per time step.
Here, nA and nB are the number of complexes of the respective species.
Division by the cell-volume V yields units of ‘concentration per time’.
Thus, kon in a biochemical reaction approximately equals ka ·V, since the total
number of complexes NC is very large in all scenarios that we have simulated.
8. Lecture SS 20005
Cell Simulations 21
Biochemical Interpretation of the Rate Constants
When looking for an equivalent expression for koff we have to quantify the specific
dissociation of a complex AB into the subcomplexes A and B.
The unspecific dissociation of AB is simply kd ·[AB],
kd : dissociation rate constant.
Since AB may consist of > 2 proteins it can also be split into subcomplexes other
than A and B. For the specific dissociation rate, one has to know how often AB
actually dissociates into the subcomplexes A and B.
The total number of dissociations per time step is kd · NC. The probability that a
complex AB with size i breaks into the specific sub-complexes A and B is 1/Ni,
Ni : number of possible fragments of a complex of size i.
This holds under the assumption that all proteins in AB are distinct, which is
approximately true for the simulations conducted here.
8. Lecture SS 20005
Cell Simulations 22
Biochemical Interpretation of the Rate Constants
nAB/NC : fraction of complexes AB among all complexes
size specific dissociation rate N AB dissoc (i): AB
N
k
N
n
NV
Nk
V
iN
i
d
C
AB
i
Cd
dissocBA
,
from which the complex size dependent rate constant koff.(i) = kd/Ni results.
Taking into account that certain proteins may be in the complex more than once
we get koff = kd/Ni.
One can calculate an apparent equilibrium constant KD, which describes the
equilibrium between the independent species A and B and the bound species AB:
VNk
k
k
k
AB
BAiK
ia
d
on
offD
where i is the size of the complex AB.
Since Ni is exponentially increasing with i, KD is exponentially decreasing
with complex size.
8. Lecture SS 20005
Cell Simulations 23
Results
We dynamically simulated the association and
dissociation of 6200 different protein types
yielding a set of about 50 million protein
molecules. Subsequently we analyzed the
resulting steady-state size distribution of
protein complexes. This steady-state is
thought to reflect the log-growth conditions
under which the yeast cells were held when
TAP-measuring the protein complexes (Gavin
et al., 2002). Based on measured protein
complex data (Gavin et al., 2002) we
calculated a protein complex size distribution
to which we can compare the simulation
results (Figure 1).
8. Lecture SS 20005
Cell Simulations 24
Results
The TAP measurements do not provide concentrations of the measured
complexes, but they only demonstrate the presence of a certain protein complex
in yeast cells. In addition, also the number of proteins of a certain type inside
such a complex could not be measured. Hence, the complex size from Figure 1
does not represent real complex sizes (i.e. total number of proteins in the
complex), but it refers to the number of different proteins in a complex.
The measured data reflect the characteristics of only 229 different protein
complexes of size two and larger, which is just a small subset of the
‘complexosome’. These peculiarities have to be taken into account when
comparing simulation results to the observed complex size distribution. We refer
to the ‘measurable complex size’ as the number of distinct proteins in a protein
complex (Figure 2). When comparing our simulation results to the measurements,
we always select a random-subset of 229 different complexes from the simulated
pool of complexes. This results in a complex size distribution comparable to the
measured distribution from Figure 1 (‘bait distribution’).
8. Lecture SS 20005
Cell Simulations 25
Effect of preferential attachment
Cumulative number of distinct protein
complexes versus their size, resulting
from simulations without (diamonds) and
with (squares) preferential attachment to
larger complexes. Both simulations are
performed with the best fit parameters
for PAD A. In case of preferential
attachment the best regression result
(solid line) is obtained with a power-law,
while the simulation without preferential
attachment is best fitted assuming an
exponentially decreasing curve. The
original, measurable and bait
distributions are always close to
exponential in case of PAD A and
power-law like in case of PAD B,
independent of the parameters chosen.
PAD B model gives power-lawdistribution not in agreementwith experimental observation.
8. Lecture SS 20005
Cell Simulations 26
Conclusions
- very simple, dynamic model can reproduce the observed complex size
distribution. Given the small number of input parameters the very good fit of the
observed data is astonishing.
Conclusion 1 preferential attachment does not take place in yeast cells under
the investigated conditions. This is biologically plausible: Specific and strong
binding can be just as important for small protein complexes as for large
complexes.
the dissociation should on average be independent of the complex size. The
interpretation of the simulated association and dissociation in terms of KD-values
suggests that larger complexes bind more strongly than smaller complexes.
However, the size dependence of KD is compensated by the higher number of
possible dissociations in larger complexes.
We always assume that all possible dissociations happen with the same
probability. In reality large complexes may break into specific subcomplexes,
which subsequently can be re-used for a different purpose.
Improved versions of the model should account for specificity of association
and for specific dissociation.
8. Lecture SS 20005
Cell Simulations 27
Conclusions
Conclusion 2 the number of complexes that were missed during the TAP
measurements is potentially large. Simulations give an upper limit of the number
of different complexes in cells.
At a first glance, the number of different complexes in PAD A (> 3.5 mill.) and
PAD C (~ 2 mill.) may appear to be far too large. Even PAD C may overestimate
the true number of different complexes, because association within the groups is
unrestricted. However, the PAD-models do not only simulate functional, mature
complexes, but they also consider all intermediate steps. Each of these steps is
counted as a different protein complex. The large difference between the number
of measured complexes and the (potential) number of existing complexes may
partly explain the very small overlap that has been observed between different
large scale measurements of protein complexes. A correct interpretation of the
kinetic parameters is important. First of all, ka and kd cannot be compared to real
numbers, because the model does not define a length of the time steps for
interpreting ka and kd as actual rate constants. In addition, the association-to-
dissociation ratio Rad is not identical to a physical KD-value obtained by in vitro
measurements of protein binding in water solutions.
8. Lecture SS 20005
Cell Simulations 28
Discussion
Several reasons do not allow for this simple interpretation:
(i) In vivo diffusion rates are below those in water due to the high concentration of
proteins and other large molecules in the cytosol.
(ii) Most proteins either are synthesized where they are needed or they get
transported directly to the site where the complex gets compiled. Hence,
transport to the site of action is on average faster than random diffusion.
(iii) Protein concentrations are often above the cell average due to the
compartmentalization of the cell. All these processes (protein production,
transport, and degradation) are not explicitly described in the PAD-model, but
they are lumped in our assumptions.
The Rad (and the KD derived from it) must therefore be interpreted as an
operationally defined property. It characterizes the overall, cell averaged complex
assembly process, which includes all steps necessary to synthesize a protein
complex.
8. Lecture SS 20005
Cell Simulations 29
Discussion
However, even the model-derived KD-s allow for some conclusions regarding
complex formation. We calculated weighted averages (KD ) of the size-dependent
KD -values by using the steady-state complex size distribution of the best fit.
This yields average KD -s of 4.7 nM and 0.18 nM for the best fits of PAD A and
PAD C, respectively. First, the fact that the KD for PAD C is below that of PAD A
underlines the notion, that more specific binding is reflected by smaller KD values.
Second, typical in vitro KD–values are above 1 nM, thus the average KD of PAD C
is comparably low.
The model therefore confirms that protein complex formation in vivo gets
accelerated due to directed protein transport and due to the compartmentalization
of eukaryotes. It is a surprising finding though, that important aspects of these
highly regulated protein synthesis and transport processes can on average be
described by a simple compartment model assuming random association and
dissociation. Large scale protein-protein interaction data sets are subject to
substantial error, resulting in a potentially large number of false positives and false
negatives.
8. Lecture SS 20005
Cell Simulations 30
Possible Limitations
In order to get a correct picture of the protein complex size distribution it is
necessary to have an unbiased, random subset of all complexes in the cells.
TAP data are biased, e.g. contain too few membrane proteins. However, if
compared to other data sets such as MIPS complexes, the TAP complexes
constitute a fairly random selection of all protein complexes in yeast.
Uncertainties in the TAP data do not affect our conclusions as long as they are
not strongly biased with respect to the resulting complex size distribution.
Since Gavin et al. (2002) have measured long-term interactions, our results apply
to permanent complexes. Yet the model is applicable to future protein complex
data that take account of transient binding.
8. Lecture SS 20005
Cell Simulations 31
Discussion
The simulated complex size distribution is almost independent of the assumed
protein abundance distribution.
PP is a valuable summarizing property that can be used to characterize
proteomes of different species. A decreasing PP increases the number of different
large complexes (the slope in Table 1 gets more shallow), because it is less likely
that a large complex contains the same protein twice.
Thus, PP is a measure of complexity that not only relates to the diversity of the
proteome but also to the composition of protein complexes.
Probably the most severe simplification in our model is the assumption that all
proteins can potentially interact with each other. The PAD-model C is a first step
towards more biological realism. By restricting the number of potential interaction
partners it more closely maps functional modules and cell compartments, which
both restrict the interaction among proteins.
8. Lecture SS 20005
Cell Simulations 32
Further improvements
The partitioning in PAD C connotes that proteins within one group exhibit very
strong binding, whereas binding between protein groups is set to zero. This again
is a simplification, since cross-talk between different modules or compartments is
possible.
Future extensions of the model could incorporate more and more detailed
information about the binding specificity of proteins. Assuming even more specific
binding will further reduce the number of different complexes, whereas the
frequency of the complexes will increase. High binding specificity potentially
lowers the complex sizes, so Rad has to be increased in order to fit the
experimentally observed protein complex size distribution.
On the other hand, cross talk gives rise to larger complexes. Taking both
counteracting refinements into account, it is impossible to generally predict the
best-fit Rad, since it depends on the quantitative details.
8. Lecture SS 20005
Cell Simulations 33
Further improvements
A first additional refinement of PAD C could account for the observed clustering
of protein interaction networks.
In a second step one could simulate protein associations and dissociations
according to predefined binary protein interactions.
A most detailed model could additionally account for individual
association/dissociation rates between individual proteins. Such extensions will
yield more realistic figures about the number of different protein complexes
created in yeast cells.
However, starting model development with the most simple assumptions reveals
the most important characteristics of the system for reproducing the observations.
The very good match that we already obtain with the most simple model PAD A is
striking.
8. Lecture SS 20005
Cell Simulations 35
Measurable Size Distribution and Bait Selection
In order to determine the distribution of measurable complex sizes corresponding
to the steady-state distribution, we create a set of complexes according to the
original steady-state size distribution by randomly ‘filling’ the complexes with
proteins from the protein abundance distribution. We then compute the resulting
measurable size distribution. Results shown are averages of several of such
random sets.
The bait distribution is used to compare the simulation to the actual TAP
measurements. The bait distribution is obtained by randomly selecting and
subsequently analyzing a subset from all simulated complexes. We call that
distribution ‘bait distribution’, because the process of selecting a subset of all
complexes corresponds to selecting bait proteins for pulling out the complexes
during the measurements. We always select 229 different complexes, which is
the number of TAP complexes to which we
compare the simulations.
8. Lecture SS 20005
Cell Simulations 36
Only the Probability of Selecting the Same Protein Twice Matters
In a real protein abundance distribution each gene can potentially have a different
expression level. However, for our model it is sufficient to use averaged
properties of the protein abundance distribution.
Each protein complex can be viewed as a set of proteins that are independently
drawn from the pool of available proteins, because we assume that association is
independent of the particular proteins. The measurable complex size is the actual
complex size reduced by the number of proteins that are drawn twice or more
often. The measurable complex size is on average only affected by the probability
PP of pulling the same protein twice out of the pool of available
proteins.
This probability can be calculated by using the fact that (ni – 1)/ (NP – 1) is the
probability of selecting a second protein of the specific type i:
k
i
k
iiP
P
i
P
iP nN
N
n
N
nP
1 1
22
1
1
where NP is the total number of proteins in cell, k is the number of protein types
(or genes), and ni is the number of molecules of the respective protein type.
8. Lecture SS 20005
Cell Simulations 37
Computation of a Dissociation Constant KD
Theoretically one would have to separately account for proteins that are present 3
times or more often in one complex. However, the likelihood that a protein gets
selected for a third time is orders of magnitude below the probability to select a
protein twice. Therefore, the few cases when a protein gets selected more often
then twice are negligible. We have verified this statement by respective
simulations.
Additionally, the measurable complex size distribution counts only distinct
complexes, i.e. we have to reduce the complex frequency by those complexes
that appear more than once. In model PAD A, which assumes one large
compartment, it is very unlikely that a complex of size 3 appears more than once
due to the heterogeneity of the protein pool. Hence, in this case mainly the
frequency of complexes with size 2 (x2) is affected by counting only distinct
complexes. This x2 in turn depends on PP. Since model PAD C assumes more
homogenous interacting groups, the likelihood to find larger complexes more than
once increases. However, the larger homogeneity is reflected by a larger PP. In
summary, for the scenarios that we have simulated PP proved to be a very good
descriptor of the protein abundance distribution.