information transfer in biological systems · 2011-06-21 · information transfer in biological...

Information Transfer in Biological Systems:Extracting Information from Biological Data∗

W. Szpankowski†

Department of Computer SciencePurdue University, W. Lafayette, IN 47907

June 21, 2011

NSF STC Center for Science of Information

Venice, 2011

Dedicated to PHILIPPE FLAJOLET

∗Research supported by NSF Science & Technology Center, and Humboldt Foundation.†With M. Aktulga, Y. Choi, A. Grama, M. Koyuturk, I. Kontoyiannis, L. Lyznik, M. Regnier, L. Szpankowski.

Outline

1. What is Information?

• Shannon Legacy• Post-Shannon Information

2. Information Discovery in Massive Data: Pattern Matching

• Statistical Significance• Finding Weak Signals in Biological Data

3. Information in Correlation: Finding Dependency

• Mutual Information• Alternative Splicing

4. Information in Structures: Network Motifs

• Biological Networks• Finding Biologically Significant Structures• Information Embedded in Graphs

Biological Signals?

Biological world is highly stochastic and inhomogeneous (S. Salzberg).

Start codon codons Donor site

CGCCATGCCCTTCTCCAACAGGTGAGTGAGC

Transcription start

Exon

Promoter 5’ UTR CCTCCCAGCCCTGCCCAG

Acceptor site

Intron

Stop codon

GATCCCCATGCCTGAGGGCCCCTCGGCAGAAACAATAAAACCAC

Poly-A site

3’ UTR

Life is a delicate interplay of energy, entropy, and information; essentialfunctions of living beings correspond to the generation, consumption,processing, preservation, and duplication of information.

Information Flow in Biology

Manfred Eigen (Nobel Prize, 1967)“The differentiable characteristic of the living systems is Information.Information assures the controlled reproduction of all constituents,ensuring conservation of viability . . . . Information theory,pioneered by Claude Shannon, cannot answer this question . . .

in principle, the answer was formulated 130 years ago by Charles Darwin”.

P. Nurse, (Nature, 2008, “Life, Logic, and Information”):Focusing on information flow will help to understand better

how cells and organisms work.

“. . . the generation of spatial and temporal order,

cell

memory and reproduction are not fully understood”.

Some Fundamental Questions:

• how information is generated and transferred through underlyingmechanisms of variation and selection;

• how information in biomolecules (sequences and structures) relates tothe organization of the cell;

• whether there are error correcting mechanisms (codes) in biomolecules;

• and how organisms survive and thrive in noisy environments.

Shannon Information Theory and Its Challenges

Information Revolution started in 1948 by Claude Shannon.

Claude Shannon:Shannon information quantifies the extent to which a recipient of datacan reduce its statistical uncertainty.“These semantic aspects of communication are irrelevant . . .”

Post-Shannon Challenges:Classical Information Theory needs a recharge to meet new challenges ofnowadays applications in biology, modern communication, knowledgeextraction, economics and physics, . . . .

We need to extend Shannon information theory to include (“meaning”):

structure, time, space, and semantics ,

and others such as: dynamic information, limited resources, complexity,physical information, representation-invariant information, and cooperation& dependency.

What is Information1?

C. F. Von Weizsacker:“Information is only that which produces information” (relativity).“Information is only that which is understood” (rationality)“Information has no absolute meaning”.

Informally Speaking: A piece of data carries information if it can impacta recipient’s ability to achieve the objective of some activity in a givencontext within limited available resources.

Event-Driven Paradigm: Systems, State, Event, Context, Attributes,Objective: Objective function objective(R,C) maps systems’ rule R andcontext C in to an objective space.

Definition 1 (J. Konorski, W.S., 2004). The amount of information (in a faultless

scenario) I(E) carried by the event E in the context C as measured for a

system with the rules of conduct R is

IR,C(E) = cost[objectiveR(C(E)), objectiveR(C(E) + E)]

where the cost (weight, distance) is a cost function.

1Russell’s reply to Wittgenstein’s precept “whereof one cannot speak, therefore one must be silent” was“. . . Mr. Wittgenstein manages to say a good deal about what cannot be said.”

Post-Shannon Challenges

Structure:Measures are needed for quantifyinginformation embodied in structures(e.g., material structures, nanostructures,

biomolecules, gene regulatory networks

protein interaction networks, social networks.

Time & Space:Classical Information Theory is at its weakestin dealing with problems of delay(e.g., information arriving late maybeuseless or has less value).

Limited Computational Resources:In many scenarios, informationis limited by availablecomputational resources(e.g., cell phone, living cell).

NSF Center for Science of Information

In 2010 National Science Foundation established $25M

Science and Technology Center for Science of Information(http: soihub.org)

to advance science and technology through a new quantitativeunderstanding of the representation, communication and processing ofinformation in biological, physical, social and engineering systems.

The center is located at Purdue University and partner istitutions include:Berkeley, MIT, Princeton, Stanford, UIUC and Bryn Mawr & Howard U.

Specific Center’s Goals:

• define core theoretical principles governing transfer of information.

• develop meters and methods for information.

• apply to problems in physical and social sciences, and engineering.

• offer a venue for multi-disciplinary long-term collaborations.

• transfer advances in research to education and industry.

Outline Update



• Statistical Significance• Finding Weak Signals in Biological Data



Pattern Matching

Let W = w1 . . . wm and T be strings over a finite alphabet A.

Basic question: how many times W occurs in T .

Define On(W) — the number of times W occurs in T , that is,

On(W) = #{i : Tii−m+1 = W, m ≤ i ≤ n}.

Basic Thrust of our Approach

When searching for over-represented or under-represented patterns wemust assure that such a pattern is not generated by randomness itself(to avoid too many false positives).

Weak Signals and Artifacts

In a (biological) sequence whenever a word is overrepresented, then itssubwords and proximate words are also likely to be overrepresented (theso called artifacts).Example: if W1 = AATAAA, then W2 = ATAAAN is alsooverrepresented.

New Approach:

Once a dominating signal has been detected, we look for aweaker signal by comparing the number of observed occurrencesof patterns to the conditional expectations not the regularexpectations.

In particular, using the methodology presented above Denise and Regnier(2002) were able to prove that

E[On(W2)|On(W1) = k] ∼ αn

provided W1 is overrepresented, where α can be explicitly computed(often α = P (W2) is W1 and W2 do not overlap).

Polyadenylation Signals in Human Genes

Beaudoing et al. (2000) studied several variants of the well known AAUAAApolyadenylation signal in mRNA of humans genes. To avoid artifactsBeaudoing et al cancelled all sequences where the overrepresentedhexamer was found.

Using our approach Denise and Regnier (2002) discovered/eliminatedall artifacts and found new signals in a much simpler and reliable way.

Hexamer Obs. Rk Exp. Z-sc. Rk Cd.Exp. Cd.Z-sc. Rk

AAUAAA 3456 1 363.16 167.03 1 1

AAAUAA 1721 2 363.16 71.25 2 1678.53 1.04 1300

AUAAAA 1530 3 363.16 61.23 3 1311.03 6.05 404

UUUUUU 1105 4 416.36 33.75 8 373 .30 37.87 2

AUAAAU 1043 5 373.23 34.67 6 1529.15 12.43 4078

AAAAUA 1019 6 363.16 34.41 7 848.76 5.84 420

UAAAAU 1017 7 373.23 33.32 9 780.18 8.48 211

AUUAAA 1013 l 373.23 33.12 10 385.85 31.93 3

AUAAAG 972 9 184.27 58.03 4 593.90 15.51 34

UAAUAA 922 10 373.23 28.41 13 1233.24 –8.86 4034

UAAAAA 922 11 363.16 29.32 12 922.67 9.79 155

UUAAAA 863 12 373.23 25.35 15 374.81 25.21 4

CAAUAA 847 13 185.59 48.55 5 613.24 9.44 167

AAAAAA 841 14 353.37 25.94 14 496.38 15.47 36

UAAAUA 805 15 373.23 22.35 21 1143.73 –10.02 4068

Outline Update




• Mutual Information• Alternative Splicing


How to Measure Dependency: Theoretical Background

Question: How to measure dependency between two parts of DNA?

Suppose we have two strings of unequal lengths

Xn1 = X1, X2, . . . , Xn

Y M1 = Y1, Y2, Y3, . . . . . . . . . , YM ,

where M ≥ n, taking valuesin a common finite alphabet Asuch that:

X1n

Y1M

I(X;Y)

We wish to differentiate between the following two scenarios:

(I) Independence: Xn1 and Y M

1 are independent.(II) Dependence. Find 1 ≤ J ≤ M − n + 1 so Y J+n−1

J depends on Xn1 .

How to Measure Dependency: Theoretical Background

Question: How to measure dependency between two parts of DNA?

Suppose we have two strings of unequal lengths

Xn1 = X1, X2, . . . , Xn

Y M1 = Y1, Y2, Y3, . . . . . . . . . , YM ,

where M ≥ n, taking valuesin a common finite alphabet Asuch that:

X1n

Y1M

I(X;Y)

We wish to differentiate between the following two scenarios:

(I) Independence: Xn1 and Y M

1 are independent.(II) Dependence. Find 1 ≤ J ≤ M − n + 1 so Y J+n−1

J depends on Xn1 .

To distinguish (I) and (II), we compute mutual information defined as

I(X; Y ) =∑

x,y∈A

V (x, y) logV (x, y)

P (x)Q(y)

where V (x, y) is joint distribution of X and Y . We conclude:(I) Independence implies Ij(n)→ 0 (since then V (x, y) = P (x)Q(x)),

(II) Dependence implies Ij(n)> 0.

Scenario (I): Independence

The multivariate central limit theorem (for pj) shows that nI(n) convergesin distribution to a (scaled) χ2 chi-squared distribution, that is,

(2 ln 2)nI(n)D−→ Z ∼ χ

2((|A| − 1)

2),

where Z has a χ2 distribution with k = (|A| − 1)2) degrees of freedom.

Therefore,

Pe,1 = Pr{declare dependence | independent strings}= Pr{I(n) > θ | independent strings}≈ Pr{Z > (2 ln 2)θn},

hence

Pe,1 ≈ 1 − γ(k, (θ ln 2)n)

Γ(k)≈ exp

(

− (θ ln 2)n)

.

Scenario (II): Dependence

Here, I(n) → I with probability one at rate 1/√n, that is,

√n[I(n) − I]

D−→ V ∼ N(0, σ2).

Therefore:

Pe,2 = Pr{declare independence | -dependent strings}= Pr{I(n) ≤ θ |W -dependent strings}

≈ Pr{V ≤ [θ − I]√n} ≈ exp

(

− (I − θ)2

2σ2n)

,

Practical Approach:In practice declaring false positives is much more undesirable thanoverlooking potential dependence. In our experiments we decide on anacceptably small false-positive probability ǫ, and then select thershold θ

such that

Pe,1 = Pr{I(n) > θ | independent strings} ≈ ǫ.

Alternative Splicing

Dependency exists between alternative splicing locations?How to measure and discover them?

Note: Only 25.000 genes must produce 100,000 proteins.

Alternative Splicing in zmSRp32

Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a

number of exons (shaded boxes) and introns (lines) flanked by the 5’ and 3’ untranslated

regions, UTR (white boxes). The gene zmSRp32 is coded by 4735 nucleotides and has four

alternative splicing variants. Two of these four variants are due to different splicings of this

gene, between positions 1–369 and 3243–4220.

Results: Finding Dependency

3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 42000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Base position on zmSRp32 gene seq

Mut

ual I

nfor

mat

ion

3200 3300 3400 3500 3600 3700 3800 39000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08


Mut

ual I

nfor

mat

ion

(a) (b)

3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 42000

0.05

0.1

0.15

0.2

0.25

0.3

0.35


Mut

ual I

nfor

mat

ion

3200 3300 3400 3500 3600 3700 3800 39000

0.01

0.02

0.03

0.04

0.05

0.06


Mut

ual I

nfor

mat

ion

(c) (d)

Figure 2: Ij versus j for various groupings of nucleotides]: In (a) and (c), mutual information between

the exon found between 1−78 and the intron located between 3243−4220, over: (a) four letter-alphabet

{A,C,G, T}; and (b) the Watson-Crick pairs {AT,CG}. In (b) and (d), mutual information between the

[exon located between 1 − 369 and the intron between 3243 − 4220, for the (b) four letter alphabet and

the (d) two letter grouping of purine/pyrimidine {AG,CT}.

Conclusions

There results strongly suggests that: there is significant dependencebetween the bases in positions 1–369 and certain substrings of the basesin positions 3243–4220.

The 1–369 region contains the 5’ untranslated sequences (UTR), an intron,and the first protein coding exon, the 3243–4220 sequence encodes anintron that undergoes alternative splicing

We narrow down the mutual information calculations to the 5’untranslated region (5’UTR) in positions 1–78 and the 5’UTR intron inpositions 78–268.

A close inspection of the resulting mutual information graphs indicatesthat the dependency is restricted to the alternative exons embeddedinto the intron sequences in positions 3688–3800 and 3884–4254.

Outline Update



3. Information in Correlation



Modularity in Protein-Protein Interaction Networks

• A functionally modular group of proteins (e.g. a protein complex) islikely to induce a dense subgraph

• Algorithmic approaches target identification of dense subgraphs

• An important problem: How do we define dense?

– Statistical approach: What is significantly dense?

RNA Polymerase II Complex Corresponding induced subgraph

Significance of Dense Subgraphs

• A subnet of r proteins is said to be ρ-dense if the number of interactions(edges), F (r), between these r proteins is ≥ ρr2, that is,

F (r) ≥ ρr2

• What is the expected size, Rρ, of the largest ρ-dense subgraph in arandom graph?

– Any ρ-dense subgraph with larger size is statistically significant!– Maximum clique is a special case of this problem (ρ = 1)

• G(n, p) model

– n proteins, each interaction occurs with probability p

– Simple enough to facilitate rigorous analysis

• Piecewise G(n, p) model

– Captures the basic characteristics of PPI networks

• Power-law model

Largest Dense Subgraph on G(n, p)

Theorem 1 (Koyuturk, Grama, W.S., 2006). If G is a random graph with n

nodes, where every edge exists with probability p, then

limn→∞

Rρ

logn=

1

Hp(ρ)

where

Hp(ρ) = ρ logρ

p+ (1 − ρ) log

1 − ρ

1 − pdenotes divergence. More precisely,

P (Rρ ≥ r0) ≤ O

(

log n

n1/Hp(ρ)

)

,

where

r0 =log n − log logn + logHp(ρ)

Hp(ρ)

for large n.

Piecewise G(n, p) Model

• Few proteins with many interacting partners, many proteins with fewinteracting partners

Captures the basic characteristicsof PPI networks,where pl < pb < ph.

ph > pb > p1

Vh V1

• G(V,E), V = Vh ∪ Vl such that nh = |Vh| ≪ |Vl| = nl

P (uv ∈ E(G)) =

ph if u, v ∈ Vh

pl if u, v ∈ Vl

pb if u ∈ Vh, v ∈ Vl or u ∈ Vl, v ∈ Vh.

• If nh = O(1), then P (Rρ ≥ r1) ≤ O

(

log n

n1/Hpl(ρ)

)

where

r1 =logn − log logn + 2nh logB + logHpl

(ρ) − log e + 1

Hpl(ρ)

and B = (pb(1 − pl))/pl + (1 − pb).

SIDES

• An algorithm for identification of Significantly Dense Subgraphs (SIDES)

– Based on Highly Connected Subgraphs algorithm (Hartuv & Shamir,2000)

– Recursive min-cut partitioning heuristic– We use statistical significance as stopping criterion

p << 1p << 1

p << 1

Behavior of Largest Dense Subgraph Across Species

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 44

6

8

10

12

14

16

18

20

22

24

BT AT

OS

RNHP

EC

MM

CE

HS

SC

DM

ObservedGnp modelPiecewise model

Number of proteins (log-scale)

Size

of

larg

est

de

nse

sub

gra

ph

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 40

2

4

6

8

10

12

BT AT

OS RN

HP

EC

MM

CE

HS

SC

DM

ObservedGnp modelPiecewise model

Number of proteins (log-scale)

Size

of

larg

est

de

nse

sub

gra

ph

ρ = 0.5 ρ = 1.0

Number of nodes vs. Size of largest dense subgraphfor PPI networks belonging to 9 Eukaryotic species

Outline Update



3. Information in Correlation



Graph Information Content

The entropy of a random (labeled) graph process G is defined as

HG = E[− logP (G)] = −∑

G∈GP (G) logP (G).

A random structure model is defined for an unlabeled version. Somelabeled graphs have the same structure.

1 1 1 1

1 1 1 1

2 2 2 2

2 2 2 2

3 3 3 3

3 3 3 3

G1 G2 G3 G4

G5 G6 G7 G8

S1 S2

S3 S4

Random Graph Model Random Structure Model

The probability of a structure S is: P (S) = N(S) · P (G)

N(S) is the number of different labeled graphs having the same structure.

The entropy of a random structure S can be defined as

HS = E[− logP (S)] = −∑

S∈SP (S) logP (S),

where the summation is over all distinct structures.

Automorphism and Erdos-Renyi Graph Model

Graph Automorphism:

For a graph G its automorphismis adjacency preserving permutationof vertices of G.

a

b c

d e

The Erdos and Renyi model G(n, p) generates graphs with n vertices,where edges are chosen independently with probability p.

If G in G(n, p) has k edges, then P (G) = pkq(n2)−k, where q = 1 − p.


Graph Automorphism:


a

b c

d e



Theorem 2 (Y. Choi and W.S., 2008). For large n and all p satisfying lnnn ≪ p

and 1 − p ≫ lnnn (i.e., the graph is connected w.h.p.),

HS =(n

2

)

h(p)−logn!+o(1) =(n

2

)

h(p)−n logn+n log e−1

2log n+O(1),

where h(p) = −p log p − (1 − p) log (1 − p) is the entropy rate.

AEP for structures: 2−(n2)(h(p)+ε)+log n! ≤ P (S) ≤ 2−(n2)(h(p)−ε)+log n!.


Graph Automorphism:


a

b c

d e



Theorem 2 (Y. Choi and W.S., 2008). For large n and all p satisfying lnnn ≪ p

and 1 − p ≫ lnnn (i.e., the graph is connected w.h.p.),

HS =(n

2

)

h(p)−logn!+o(1) =(n

2

)

h(p)−n logn+n log e−1

2log n+O(1),

where h(p) = −p log p − (1 − p) log (1 − p) is the entropy rate.

AEP for structures: 2−(n2)(h(p)+ε)+log n! ≤ P (S) ≤ 2−(n2)(h(p)−ε)+log n!.

Sketch of Proof: 1. HS = HG − logn! +∑

S∈S P (S) log |Aut(S)|.2.

∑

S∈S P (S) log |Aut(S)| = o(1) by asymmetry of G(n, p).

SZIP Algorithm

Asymptotic Optimality of SZIP

Theorem 3 (Choi, W.S., 2008). Let L(S) be the length of the code

generated by our algorithm for all graphs G from G(n, p) that are

isomorphic to a structure S.

(i) For large n,

E[L(S)] ≤(n

2

)

h(p) − n logn + n (c + Φ(logn)) + o(n),

where h(p) = −p log p − (1 − p) log (1 − p), c is an explicitly computable

constant, and Φ(x) is a fluctuating function with a small amplitude or zero.

(ii) Furthermore, for any ε > 0,

P (L(S) − E[L(S)] ≤ εn logn) ≥ 1 − o(1).

(iii) Finally, our algorithm runs in O(n+e) on average, where e is the number

of edges.

Our algorithm is asymptotically optimal up to the second largest term, andworks quite fine in practise.

THANK YOU

information transfer in biological systems · 2011-06-21 · information transfer in biological...

Documents