random sequence-matching model for emergent gene-regulatory networks ayşe erzan istanbul technical...
Post on 19-Dec-2015
221 views
TRANSCRIPT
Random sequence-matching model for emergent gene-regulatory
networks
Ayşe Erzan
Istanbul Technical University, Gürsey Institute,
Collegium Budapest
Duygu Balcan (İTÜ) Muhittin Mungan (BÜ)
Alkan Kabakçıoğlu (Padova) Ayşe H. Bilge (İTÜ)
Yasemin Şengün (İTÜ)
outline• Random and “real” networks
• “central dogma” of gene regulation
• RNA interference and more
• sequence matching model for gene regulatory networks
• simulations and analytical results
• comparison with experiments
• outcomes of similar models
‘“classical” random networks
Erdös and Renyi(Publ. Math.Inst. Hung. Acad. Sci.
5, 17 (1960)
N vertices
N(N-1)/2 possible connections
with probability p
• “degree distribution”
Poissonian for large N
P(k) ~ e –z zk / k!
z = <k> = pN zc=1
• Average minimal path length
lER ~ ln N / ln z =(1+ln p/ln N)-1
• Clustering coefficient
CER = z/N = p
Random Networks
Probability of a connection between any two nodes same, p
N nodes has an average number Np of connections
“Small world” propertyDistances between nodes grow very weakly with NMost highly connected nodes
Directly reach 25% of the rest
naturally occuring networks –
• Social and economic networks• Citation and collaborative networks
• Technological networks • www, communications networks
• Biological networks:Neural networks Food networks
Co-evolutionary networks Genomic networks
R.Albert and A.-L. Barabasi, Rev.Mod. Phys. 74, 47 (2002) S.N. Dorogovtsev and J.F.F. Mendes, Adv. Phys 51, 1079 (2002)
“Real Networks”
considerable number of very highly connected nodes
Their first neighbors 60% of the total
most frequent are nodes with very few connections (1)
Small world!
“small world” / scale free networks
• High clustering coefficient
<Ci >= 2 Ei / ki (ki-1)
> CER= z /N
• Short average minimum path length <lmin>
(comparable to ER nw
for same C and N, differs from regular lattices)
• Scale free degree distribution
P(k) ~ k - , cutoff kc
a realisation:
Barabasi-Albert model of
“preferential attachment”
growing network with probability of attachment of new edge to vertex i is ~ ki
P(k) ~ k –3
(exact)
(Models with preferential attachment 2)
gene regulation networks - transcription regulatory networks
the “central dogma”DNA
promoter1 gene1 promoter2 gene2 promoter3 gene3
transcription
RNA mRNA chain
amino acid Transcription Factors
translation
Proteins(structural and regulatory)
Ribosome tRNArRNA
Adapted from Alvis Brazma, www.ebi.ac.uk/microarray/research/networks/genetics
Transcription regulatory
network of protein interactions in E. coli
(from S. Maslov)
Data from Regulon Database606 interactions424 operonsOut degree 1<kout < 85 broader
In degree 1<kin<6
Transcription regulatory
network of protein interactions
in Homo Sapiens
data obtained from literature search
1449 regulations689 proteinskout < 96
kin < 40
(from S. Maslov)
n(kout ) kout –2.5
= 2.5
•A. Wagner, Mol. Bio. Evol. 18, 1283 (2001).
Duplication and divergence of genes - interaction
between their regulatory proteins
for a review of properties see:
R.V. Sole and R. Pastor Satorras, in Handbook of Graphs and Networks (Bornholdt and Schuster eds., Wiley-VCH, Berlin 2002)
Previous “wisdom”
•out degree distribution scale free with = 2.5 !!?•A. Wagner, Mol. Bio. Evol. 18, 1283 (2001); Jeong, Mason, Barabasi, Oltvai, Nature 411, 41 (2001); Maslov and Sneppen, Science 296, 910 (2002)
•narrower in-degree distribution than out-degree distribution
•small world with non-classical clustering
Transcription Regulatory genomic and
Protein Interaction Networks (interactions between regulatory proteins)
RNA interference
New paradigm? Post Transcriptional Gene Suppression (PTGS)
“RNA can bind directly on similar DNA sequences
and silence genes at the transcriptional stage”
Watson-Crick base pairing between nucleic acids
DNA – Adenine, Guanine , Thymine, Cytosine A-T C-G
RNA - Adenine, Guanine , Uracil, Cytosine A-U C-G
stabilisation, replication and transcription of DNA
RNA interference (siRNA binding to mRNA or chr. DNA)
binding of regulatory proteins on to mRNA
Basic mechanism of (lock-and-key combinations) :
sequence matching
•D. Balcan , AE, Eur. Phys. J. B 38, 253 (2004);
•M, Mungan, A. Kabakcioglu, D. Balcan, AE , q-bio.MN/0406049
• three- dimensional architecture (secondary structure) also sequence dependent -amino-acid recognition by tRNA -amino-acid binding by rRNA in Ribosome -binding of transcription factors to promoter regions
Greater generality for modeling genomic interactions?Stay tuned!
10010101112110111012100121020011000010101000210110101011010201120010111011022
Modeling the “chromosome” – a random sequence of 0,1,2Gi Gj
0,1 coding regions , probability (1-p)/2xi =
2 start/stop signs for “genes,” probability p
string Gi = { xi1,xi2,..xi…xil} for xi 2 gene
l = length of gene <l> = (1-p) /p < n(l)> = Lp2 (1-p)-l
L = length of chromosome l = 0 a null gene
emergent gene expression networks?
sequence matching gene regulation ? Model connectivity matrix of genomic network
1 iff the string Gi is embedded inside the string Gj
wij = (Gi Gj ) ; li lj
0 otherwise.
1101
interference(suppression)
kout = 2
1101 2011000101201000110211 11011101 201010 1122
kin=1
directed
simulations:clustering coefficient
Ci = 2E(i)/ k(i) [k(i)-1]
number of edges connecting nn /total number of possible connections
For incoming or outgoing bonds to the site i
<Cout> = 0.034
<Cin> = 0.648
<C> =0.534 = < z > / < s >
non-classical bhvr
giant cluster breaks up for p < pc(L)
( L p = frequency of stop-start signs) N (number of genes) too small, genes too long
exponent ~ -3/4(preliminary)
“percolation threshold” pc :
“extremely small world” networks!
cluster radius =average minimum path length
directed edges (in or out) lmin=1 (transitivity!)
undirected edges
lmin = 1 lmax 4 11111 1 001101 0 00000
<lmin > depends very weakly on p for fixed L
pc < p < ½ :most genes of length unity
lmin undefined for p pc (L)
L = 15000 <lmin > ~ 1.66 <lmin > 1.87 as p pc
random point mutations
• x = (0 , 1) ; x mod 2 (x + 1)
• x = 2 ; x 1 x random walk steps taken by STOP and GO signs
long range modifications due to change in reading frame
simulations: network robust under random mutations
peaks ~ geometrically spaced for kout small (log-periodic) ~ periodic for kout large
last peak - the size distribution of the giant cluster (single bit genes connect to almost all others)
Degree
distribution:
preliminary
simulation
results
nm ~ kout -
Maxima of the peaks
0.9 small k
0.4 large k
no double scalingfor p=0.05
0.45
0
n(k) k -
(1.1 , 1.8)
Simulation results: Crossover in the scaling behaviour
of the degree distribution
dc
__ analytical
° simulation
1. The matching probabilitiesProbability of a given string of length l to be reproduced in a randomly chosen string of length k for an alphabet of r letters,
p (l, k) = 1- (1- r -l ) k - l+1 r –l ( k-l+1) for l large neglecting correlations between overlaps
r –l number of l – strings with r letters
( k-l+1) number of shifted l- substrings in a k – string
(1 for k = l )
very good approximation for r –l ( k-l ) < 1
Analytical calculations
Computing the matching probabilitiesstrings x and y of length k l
ya,l = substring of y, of length l that has been shifted by a
U(x, yal) = Hamming dist. bet. x and ya,l (U = 0, match, U 0, nomatch)
1- fa (x,y ; ) = 1- exp[ - U(x, yal)] 0 or 1 for (counts nomatches)
p (l, k; x, ) = 1- ( number of nomatches / r k ) summed over y
p (l, k; x, )= 1- r - k [ 1- fa (x,y ; )] all nomatch for any shift a
y a < k-l
Cluster expansion. Do x averages; 2-pt averages over the f factorise;
approximate all higher orders with factorised ones
for k l get
p (l, k) = 1- (1- z l ) k - l+1 ; z = [1+(r-1)-] / r
z l ( k-l+1) for l large
matching probability for r =2
p( l, k) = 1- ( 1- 2 - l ) k-l+1 2 –l ( k-l+1) for l k
0 otherwise
Curves with embeding string k =16,14,12,10,8,6,4,2
from top to bottom, k l
p (l, k)
l
exact enumeration
__ above expression
2. Understanding the sequence matching data
Matching l with d: long “genes” small degree
number of out-edges from a randomly chosen gene
of length l to genes of length k
Xlk = Xlk ()
: different realisations of genes of length k
Xlk () independent random var, binomially distributed : p(l,k)
Poission for small p(l,k) – large l
total number of
out-edges from a randomly chosen gene of length l
Xl = Xlk Gaussian distributed via the Central Limit Theorem
with mean < Xl> and variance < Xl 2 >- < Xl > 2
Xl Poisson for large l
3. Calculation of the out-degree distribution
•mean out-degree for genes of length l
for model with exponential gene length distribution
< n(k)> = L p 2 q – k q = 1 - p, probability of a coding element
dl = < Xk > = k l <Xkl > = k l p(l, k) <n(k)>
= Lp (q z) l / (p+ q z l ) ~ (qz) l
•variance of out-degree distribution - length l
l 2 = < Xl
2 >- < Xl > 2
= dl p (1-z l) / [1-q (1-z l ) 2] ~ dl for large l
for large l, dl l 2 Poissonian
hl l ~ n ( l )
hl ~ n( l ) / l ~ L p 2 q l / dl ½
dl ~ (qz) l h l ~ (q / z) - ½ l
h (d) ~ d - : ( q z )- = ( q / z) - ½ gives
= ½ (ln z + ln q) / (ln z - ln q)
½ - p / ln r
out-degree distribution for small l (large d) :
scaling behaviour of the envelope
2
h
P( Xl = d ) = (dl ) d exp ( - dl ) / d ! Poisson
P(d ) = l n(l ) P (Xl = d)
= Lp l p q l (dl )d exp( - dl ) / d !
0 dx x d- - ½ e-x / d ! for large l
P(d) (d + ½ - ) / (d + 1) ~ d - - ½ : Gamma funx.
where = ½ (ln z + ln q) / (ln z - ln q) ~ ½ - p/ln r
Scaling exponent 1 + ½ = 1 - p / ln r
out-degree distribution for large l (small d)
out-degree distribution : finite size effectsdotted: full Gaussian distribution taken for P (Xl = d )
solid lines: finite size correction dlout = (l
out )2 , P( Xl = d ) Poisson
Thus both for large and small l,
P(d ) = Lp l p q l (dl )d exp( - dl ) / d ! provides a good representation
peaks well seperated for l < lc ~ 8
dl ~ ( q / 2) l ;
l ~ dl 0 slower than dl
crossover occurs where
dl – dl+1 ~ l
More precisely:
(dl – dl+1 ) / 2 = l dc 6.6
(From requiring that the minimum between the two Gaussian peaks centered at dl and dl+1 vanish)
4. Crossover in the scaling behaviour
5. Simulation and analytical results:
The in-degree distribution
Solid line : finite size effect taken care of by inserting
dlin = (l
in )2
The second peak can be obtained accurately from
dlin
= k l n(l ) p(l, k)
(lin )2 = k l n(l ) p(l, k) [ 1- p(l, k) ]
p(d in) = pq l [ 2 (lin )2]-½ exp [- (d- din)2/ 2 (l
in )2]
The in-degree distribution
modelling gene interactions
A. Kabakcioglu, M, Mungan, D. Balcan, AE, preprint: sequence matching also operates in the case of
transcriptional gene interaction ?
claim: secondary structures (conformations) of transcription factors are determined by their amino acid sequence, coded for by the corresponding DNA sequence - the different folds expose precise regulatory sites, which are recognized by regulatory sequences on the genome ?
Experimental data from expression of mRNA in DNA array
M.Gustafsson, M, Hörnquist, A. Lombardi, “Large-scale reverse Engineering by Lasso,” q-bio.MN/040312. On data from P.T. Spellmann et al., Mol. Bio. Cell 9, 3273 (1998) from microarray experiments
Yeast data
(Saccaromyces cerevisae)
Expected model out-degree distribution, with
Gaussian RS length distribution
Model with a Gaussian RS length distribution
single realisation, adjustable parameters <l >, l
and Yeast data
Comparison of network of a single realisation of the
model chromosome and yeast microarray experiment
Consensus data (http://cgsigma.cshl/org )
for length distribution of Regulatory Segments
RS length Gaussian distribution with parameters fixed
by comparison with out-degree of yeast data
Single realisation for two independent sets of
Regulatory Sequences associated
with each node of the network Si, S’i
Connectivity rule: Si S’j
Note expected distributions will not change
Thanks Chrisantha!
Adnvances in Artificial Life:5th Eur. Conf. (ECAL99), Vol. 1674, LNAI, Springer
N=L/4 ppromoter seq.
of length p =4
“2”
averaged over 20 genomes - “oscillatory behavior”
from superposition of Poisson peaks
Evolution of gene networks by gene duplicationWagner, PNAS 91, 4387 (1994), Vazquez, Flammini, Maritan and Vespignani, cond-mat/ 0108043, Sole, Pastor-Satorras, Smith and Kepler, Adv. Comlex Syst. 5, 43 (2002)
• take random network• duplicate gene with connections• take out the connections with prob. and establish
new connection to random node with probability scale free proteomic model
= 2.5 , C and minimum path length compares well with data
Sequence similarity
Gaussian network, evolution
by duplication of randomly chosen RSs, mutation
(Yasemin Şengün)
Summary
• random gene interaction network model with sequence matching for
- arbitrary alphabet- finite temperature (partial matching)
• outdegree distribution power law for small d - log-periodic for large d • exponents = 1- p / ln r , 1 = 0.5 - p / ln r ~ universal for small p• single realisations compare well with experiment• not scale free - crossover behaviour ?