random sequence-matching model for emergent gene-regulatory networks ayşe erzan istanbul technical...

Random sequence-matching model for emergent gene-regulatory

networks

Ayşe Erzan

Istanbul Technical University, Gürsey Institute,

Collegium Budapest

Duygu Balcan (İTÜ) Muhittin Mungan (BÜ)

Alkan Kabakçıoğlu (Padova) Ayşe H. Bilge (İTÜ)

Yasemin Şengün (İTÜ)

outline• Random and “real” networks

• “central dogma” of gene regulation

• RNA interference and more

• sequence matching model for gene regulatory networks

• simulations and analytical results

• comparison with experiments

• outcomes of similar models

‘“classical” random networks

Erdös and Renyi(Publ. Math.Inst. Hung. Acad. Sci.

5, 17 (1960)

N vertices

N(N-1)/2 possible connections

with probability p

• “degree distribution”

Poissonian for large N

P(k) ~ e –z zk / k!

z = <k> = pN zc=1

• Average minimal path length

lER ~ ln N / ln z =(1+ln p/ln N)-1

• Clustering coefficient

CER = z/N = p

Random Networks

Probability of a connection between any two nodes same, p

N nodes has an average number Np of connections

“Small world” propertyDistances between nodes grow very weakly with NMost highly connected nodes

Directly reach 25% of the rest

naturally occuring networks –

• Social and economic networks• Citation and collaborative networks

• Technological networks • www, communications networks

• Biological networks:Neural networks Food networks

Co-evolutionary networks Genomic networks

R.Albert and A.-L. Barabasi, Rev.Mod. Phys. 74, 47 (2002) S.N. Dorogovtsev and J.F.F. Mendes, Adv. Phys 51, 1079 (2002)

“Real Networks”

considerable number of very highly connected nodes

Their first neighbors 60% of the total

most frequent are nodes with very few connections (1)

Small world!

“small world” / scale free networks

• High clustering coefficient

<Ci >= 2 Ei / ki (ki-1)

> CER= z /N

• Short average minimum path length <lmin>

(comparable to ER nw

for same C and N, differs from regular lattices)

• Scale free degree distribution

P(k) ~ k - , cutoff kc

a realisation:

Barabasi-Albert model of

“preferential attachment”

growing network with probability of attachment of new edge to vertex i is ~ ki

P(k) ~ k –3

(exact)

(Models with preferential attachment 2)

gene regulation networks - transcription regulatory networks

the “central dogma”DNA

promoter1 gene1 promoter2 gene2 promoter3 gene3

transcription

RNA mRNA chain

amino acid Transcription Factors

translation

Proteins(structural and regulatory)

Ribosome tRNArRNA

Adapted from Alvis Brazma, www.ebi.ac.uk/microarray/research/networks/genetics

Transcription regulatory

network of protein interactions in E. coli

(from S. Maslov)

Data from Regulon Database606 interactions424 operonsOut degree 1<kout < 85 broader

In degree 1<kin<6

Transcription regulatory

network of protein interactions

in Homo Sapiens

data obtained from literature search

1449 regulations689 proteinskout < 96

kin < 40

(from S. Maslov)

n(kout ) kout –2.5

= 2.5

•A. Wagner, Mol. Bio. Evol. 18, 1283 (2001).

Duplication and divergence of genes - interaction

between their regulatory proteins

for a review of properties see:

R.V. Sole and R. Pastor Satorras, in Handbook of Graphs and Networks (Bornholdt and Schuster eds., Wiley-VCH, Berlin 2002)

Previous “wisdom”

•out degree distribution scale free with = 2.5 !!?•A. Wagner, Mol. Bio. Evol. 18, 1283 (2001); Jeong, Mason, Barabasi, Oltvai, Nature 411, 41 (2001); Maslov and Sneppen, Science 296, 910 (2002)

•narrower in-degree distribution than out-degree distribution

•small world with non-classical clustering

Transcription Regulatory genomic and

Protein Interaction Networks (interactions between regulatory proteins)

RNA interference

New paradigm? Post Transcriptional Gene Suppression (PTGS)

“RNA can bind directly on similar DNA sequences

and silence genes at the transcriptional stage”

Watson-Crick base pairing between nucleic acids

DNA – Adenine, Guanine , Thymine, Cytosine A-T C-G

RNA - Adenine, Guanine , Uracil, Cytosine A-U C-G

stabilisation, replication and transcription of DNA

RNA interference (siRNA binding to mRNA or chr. DNA)

binding of regulatory proteins on to mRNA

Basic mechanism of (lock-and-key combinations) :

sequence matching

•D. Balcan , AE, Eur. Phys. J. B 38, 253 (2004);

•M, Mungan, A. Kabakcioglu, D. Balcan, AE , q-bio.MN/0406049

• three- dimensional architecture (secondary structure) also sequence dependent -amino-acid recognition by tRNA -amino-acid binding by rRNA in Ribosome -binding of transcription factors to promoter regions

Greater generality for modeling genomic interactions?Stay tuned!

10010101112110111012100121020011000010101000210110101011010201120010111011022

Modeling the “chromosome” – a random sequence of 0,1,2Gi Gj

0,1 coding regions , probability (1-p)/2xi =

2 start/stop signs for “genes,” probability p

string Gi = { xi1,xi2,..xi…xil} for xi 2 gene

l = length of gene <l> = (1-p) /p < n(l)> = Lp2 (1-p)-l

L = length of chromosome l = 0 a null gene

emergent gene expression networks?

sequence matching gene regulation ? Model connectivity matrix of genomic network

1 iff the string Gi is embedded inside the string Gj

wij = (Gi Gj ) ; li lj

0 otherwise.

1101

interference(suppression)

kout = 2

1101 2011000101201000110211 11011101 201010 1122

kin=1

directed

simulations:clustering coefficient

Ci = 2E(i)/ k(i) [k(i)-1]

number of edges connecting nn /total number of possible connections

For incoming or outgoing bonds to the site i

<Cout> = 0.034

<Cin> = 0.648

<C> =0.534 = < z > / < s >

non-classical bhvr

giant cluster breaks up for p < pc(L)

( L p = frequency of stop-start signs) N (number of genes) too small, genes too long

exponent ~ -3/4(preliminary)

“percolation threshold” pc :

“extremely small world” networks!

cluster radius =average minimum path length

directed edges (in or out) lmin=1 (transitivity!)

undirected edges

lmin = 1 lmax 4 11111 1 001101 0 00000

<lmin > depends very weakly on p for fixed L

pc < p < ½ :most genes of length unity

lmin undefined for p pc (L)

L = 15000 <lmin > ~ 1.66 <lmin > 1.87 as p pc

random point mutations

• x = (0 , 1) ; x mod 2 (x + 1)

• x = 2 ; x 1 x random walk steps taken by STOP and GO signs

long range modifications due to change in reading frame

simulations: network robust under random mutations

peaks ~ geometrically spaced for kout small (log-periodic) ~ periodic for kout large

last peak - the size distribution of the giant cluster (single bit genes connect to almost all others)

Degree

distribution:

preliminary

simulation

results

nm ~ kout -

Maxima of the peaks

0.9 small k

0.4 large k

no double scalingfor p=0.05

0.45

0

n(k) k -

(1.1 , 1.8)

Simulation results: Crossover in the scaling behaviour

of the degree distribution

dc

__ analytical

° simulation

1. The matching probabilitiesProbability of a given string of length l to be reproduced in a randomly chosen string of length k for an alphabet of r letters,

p (l, k) = 1- (1- r -l ) k - l+1 r –l ( k-l+1) for l large neglecting correlations between overlaps

r –l number of l – strings with r letters

( k-l+1) number of shifted l- substrings in a k – string

(1 for k = l )

very good approximation for r –l ( k-l ) < 1

Analytical calculations

Computing the matching probabilitiesstrings x and y of length k l

ya,l = substring of y, of length l that has been shifted by a

U(x, yal) = Hamming dist. bet. x and ya,l (U = 0, match, U 0, nomatch)

1- fa (x,y ; ) = 1- exp[ - U(x, yal)] 0 or 1 for (counts nomatches)

p (l, k; x, ) = 1- ( number of nomatches / r k ) summed over y

p (l, k; x, )= 1- r - k [ 1- fa (x,y ; )] all nomatch for any shift a

y a < k-l

Cluster expansion. Do x averages; 2-pt averages over the f factorise;

approximate all higher orders with factorised ones

for k l get

p (l, k) = 1- (1- z l ) k - l+1 ; z = [1+(r-1)-] / r

z l ( k-l+1) for l large

matching probability for r =2

p( l, k) = 1- ( 1- 2 - l ) k-l+1 2 –l ( k-l+1) for l k

0 otherwise

Curves with embeding string k =16,14,12,10,8,6,4,2

from top to bottom, k l

p (l, k)

l

exact enumeration

__ above expression

2. Understanding the sequence matching data

Matching l with d: long “genes” small degree

number of out-edges from a randomly chosen gene

of length l to genes of length k

Xlk = Xlk ()

: different realisations of genes of length k

Xlk () independent random var, binomially distributed : p(l,k)

Poission for small p(l,k) – large l

total number of

out-edges from a randomly chosen gene of length l

Xl = Xlk Gaussian distributed via the Central Limit Theorem

with mean < Xl> and variance < Xl 2 >- < Xl > 2

Xl Poisson for large l

3. Calculation of the out-degree distribution

•mean out-degree for genes of length l

for model with exponential gene length distribution

< n(k)> = L p 2 q – k q = 1 - p, probability of a coding element

dl = < Xk > = k l <Xkl > = k l p(l, k) <n(k)>

= Lp (q z) l / (p+ q z l ) ~ (qz) l

•variance of out-degree distribution - length l

l 2 = < Xl

2 >- < Xl > 2

= dl p (1-z l) / [1-q (1-z l ) 2] ~ dl for large l

for large l, dl l 2 Poissonian

hl l ~ n ( l )

hl ~ n( l ) / l ~ L p 2 q l / dl ½

dl ~ (qz) l h l ~ (q / z) - ½ l

h (d) ~ d - : ( q z )- = ( q / z) - ½ gives

= ½ (ln z + ln q) / (ln z - ln q)

½ - p / ln r

out-degree distribution for small l (large d) :

scaling behaviour of the envelope

2

h

P( Xl = d ) = (dl ) d exp ( - dl ) / d ! Poisson

P(d ) = l n(l ) P (Xl = d)

= Lp l p q l (dl )d exp( - dl ) / d !

0 dx x d- - ½ e-x / d ! for large l

P(d) (d + ½ - ) / (d + 1) ~ d - - ½ : Gamma funx.

where = ½ (ln z + ln q) / (ln z - ln q) ~ ½ - p/ln r

Scaling exponent 1 + ½ = 1 - p / ln r

out-degree distribution for large l (small d)

out-degree distribution : finite size effectsdotted: full Gaussian distribution taken for P (Xl = d )

solid lines: finite size correction dlout = (l

out )2 , P( Xl = d ) Poisson

Thus both for large and small l,

P(d ) = Lp l p q l (dl )d exp( - dl ) / d ! provides a good representation

peaks well seperated for l < lc ~ 8

dl ~ ( q / 2) l ;

l ~ dl 0 slower than dl

crossover occurs where

dl – dl+1 ~ l

More precisely:

(dl – dl+1 ) / 2 = l dc 6.6

(From requiring that the minimum between the two Gaussian peaks centered at dl and dl+1 vanish)

4. Crossover in the scaling behaviour

5. Simulation and analytical results:

The in-degree distribution

Solid line : finite size effect taken care of by inserting

dlin = (l

in )2

The second peak can be obtained accurately from

dlin

= k l n(l ) p(l, k)

(lin )2 = k l n(l ) p(l, k) [ 1- p(l, k) ]

p(d in) = pq l [ 2 (lin )2]-½ exp [- (d- din)2/ 2 (l

in )2]

The in-degree distribution

modelling gene interactions

A. Kabakcioglu, M, Mungan, D. Balcan, AE, preprint: sequence matching also operates in the case of

transcriptional gene interaction ?

claim: secondary structures (conformations) of transcription factors are determined by their amino acid sequence, coded for by the corresponding DNA sequence - the different folds expose precise regulatory sites, which are recognized by regulatory sequences on the genome ?

Experimental data from expression of mRNA in DNA array

M.Gustafsson, M, Hörnquist, A. Lombardi, “Large-scale reverse Engineering by Lasso,” q-bio.MN/040312. On data from P.T. Spellmann et al., Mol. Bio. Cell 9, 3273 (1998) from microarray experiments

Yeast data

(Saccaromyces cerevisae)

Expected model out-degree distribution, with

Gaussian RS length distribution

Model with a Gaussian RS length distribution

single realisation, adjustable parameters <l >, l

and Yeast data

Comparison of network of a single realisation of the

model chromosome and yeast microarray experiment

Consensus data (http://cgsigma.cshl/org )

for length distribution of Regulatory Segments

RS length Gaussian distribution with parameters fixed

by comparison with out-degree of yeast data

Single realisation for two independent sets of

Regulatory Sequences associated

with each node of the network Si, S’i

Connectivity rule: Si S’j

Note expected distributions will not change

Thanks Chrisantha!

Adnvances in Artificial Life:5th Eur. Conf. (ECAL99), Vol. 1674, LNAI, Springer

N=L/4 ppromoter seq.

of length p =4

“2”

averaged over 20 genomes - “oscillatory behavior”

from superposition of Poisson peaks

Evolution of gene networks by gene duplicationWagner, PNAS 91, 4387 (1994), Vazquez, Flammini, Maritan and Vespignani, cond-mat/ 0108043, Sole, Pastor-Satorras, Smith and Kepler, Adv. Comlex Syst. 5, 43 (2002)

• take random network• duplicate gene with connections• take out the connections with prob. and establish

new connection to random node with probability scale free proteomic model

= 2.5 , C and minimum path length compares well with data

Sequence similarity

Gaussian network, evolution

by duplication of randomly chosen RSs, mutation

(Yasemin Şengün)

Summary

• random gene interaction network model with sequence matching for

- arbitrary alphabet- finite temperature (partial matching)

• outdegree distribution power law for small d - log-periodic for large d • exponents = 1- p / ln r , 1 = 0.5 - p / ln r ~ universal for small p• single realisations compare well with experiment• not scale free - crossover behaviour ?

random sequence-matching model for emergent gene-regulatory networks ayşe erzan istanbul technical...

Documents

random networks probability

economic networks citation

occuring networks social

p slide

classical random networks

rest slide

p n nodes

ln n ln z