information transfer in biological systems · 2011-06-21 · information transfer in biological...
TRANSCRIPT
Information Transfer in Biological Systems:Extracting Information from Biological Data∗
W. Szpankowski†
Department of Computer SciencePurdue University, W. Lafayette, IN 47907
June 21, 2011
NSF STC Center for Science of Information
Venice, 2011
Dedicated to PHILIPPE FLAJOLET
∗Research supported by NSF Science & Technology Center, and Humboldt Foundation.†With M. Aktulga, Y. Choi, A. Grama, M. Koyuturk, I. Kontoyiannis, L. Lyznik, M. Regnier, L. Szpankowski.
Outline
1. What is Information?
• Shannon Legacy• Post-Shannon Information
2. Information Discovery in Massive Data: Pattern Matching
• Statistical Significance• Finding Weak Signals in Biological Data
3. Information in Correlation: Finding Dependency
• Mutual Information• Alternative Splicing
4. Information in Structures: Network Motifs
• Biological Networks• Finding Biologically Significant Structures• Information Embedded in Graphs
Biological Signals?
Biological world is highly stochastic and inhomogeneous (S. Salzberg).
Start codon codons Donor site
CGCCATGCCCTTCTCCAACAGGTGAGTGAGC
Transcription start
Exon
Promoter 5’ UTR CCTCCCAGCCCTGCCCAG
Acceptor site
Intron
Stop codon
GATCCCCATGCCTGAGGGCCCCTCGGCAGAAACAATAAAACCAC
Poly-A site
3’ UTR
Life is a delicate interplay of energy, entropy, and information; essentialfunctions of living beings correspond to the generation, consumption,processing, preservation, and duplication of information.
Information Flow in Biology
Manfred Eigen (Nobel Prize, 1967)“The differentiable characteristic of the living systems is Information.Information assures the controlled reproduction of all constituents,ensuring conservation of viability . . . . Information theory,pioneered by Claude Shannon, cannot answer this question . . .
in principle, the answer was formulated 130 years ago by Charles Darwin”.
P. Nurse, (Nature, 2008, “Life, Logic, and Information”):Focusing on information flow will help to understand better
how cells and organisms work.
“. . . the generation of spatial and temporal order,
cell
memory and reproduction are not fully understood”.
Some Fundamental Questions:
• how information is generated and transferred through underlyingmechanisms of variation and selection;
• how information in biomolecules (sequences and structures) relates tothe organization of the cell;
• whether there are error correcting mechanisms (codes) in biomolecules;
• and how organisms survive and thrive in noisy environments.
Shannon Information Theory and Its Challenges
Information Revolution started in 1948 by Claude Shannon.
Claude Shannon:Shannon information quantifies the extent to which a recipient of datacan reduce its statistical uncertainty.“These semantic aspects of communication are irrelevant . . .”
Post-Shannon Challenges:Classical Information Theory needs a recharge to meet new challenges ofnowadays applications in biology, modern communication, knowledgeextraction, economics and physics, . . . .
We need to extend Shannon information theory to include (“meaning”):
structure, time, space, and semantics ,
and others such as: dynamic information, limited resources, complexity,physical information, representation-invariant information, and cooperation& dependency.
What is Information1?
C. F. Von Weizsacker:“Information is only that which produces information” (relativity).“Information is only that which is understood” (rationality)“Information has no absolute meaning”.
Informally Speaking: A piece of data carries information if it can impacta recipient’s ability to achieve the objective of some activity in a givencontext within limited available resources.
Event-Driven Paradigm: Systems, State, Event, Context, Attributes,Objective: Objective function objective(R,C) maps systems’ rule R andcontext C in to an objective space.
Definition 1 (J. Konorski, W.S., 2004). The amount of information (in a faultless
scenario) I(E) carried by the event E in the context C as measured for a
system with the rules of conduct R is
IR,C(E) = cost[objectiveR(C(E)), objectiveR(C(E) + E)]
where the cost (weight, distance) is a cost function.
1Russell’s reply to Wittgenstein’s precept “whereof one cannot speak, therefore one must be silent” was“. . . Mr. Wittgenstein manages to say a good deal about what cannot be said.”
Post-Shannon Challenges
Structure:Measures are needed for quantifyinginformation embodied in structures(e.g., material structures, nanostructures,
biomolecules, gene regulatory networks
protein interaction networks, social networks.
Time & Space:Classical Information Theory is at its weakestin dealing with problems of delay(e.g., information arriving late maybeuseless or has less value).
Limited Computational Resources:In many scenarios, informationis limited by availablecomputational resources(e.g., cell phone, living cell).
NSF Center for Science of Information
In 2010 National Science Foundation established $25M
Science and Technology Center for Science of Information(http: soihub.org)
to advance science and technology through a new quantitativeunderstanding of the representation, communication and processing ofinformation in biological, physical, social and engineering systems.
The center is located at Purdue University and partner istitutions include:Berkeley, MIT, Princeton, Stanford, UIUC and Bryn Mawr & Howard U.
Specific Center’s Goals:
• define core theoretical principles governing transfer of information.
• develop meters and methods for information.
• apply to problems in physical and social sciences, and engineering.
• offer a venue for multi-disciplinary long-term collaborations.
• transfer advances in research to education and industry.
Outline Update
1. What is Information?
2. Information Discovery in Massive Data: Pattern Matching
• Statistical Significance• Finding Weak Signals in Biological Data
3. Information in Correlation: Finding Dependency
4. Information in Structures: Network Motifs
Pattern Matching
Let W = w1 . . . wm and T be strings over a finite alphabet A.
Basic question: how many times W occurs in T .
Define On(W) — the number of times W occurs in T , that is,
On(W) = #{i : Tii−m+1 = W, m ≤ i ≤ n}.
Basic Thrust of our Approach
When searching for over-represented or under-represented patterns wemust assure that such a pattern is not generated by randomness itself(to avoid too many false positives).
Weak Signals and Artifacts
In a (biological) sequence whenever a word is overrepresented, then itssubwords and proximate words are also likely to be overrepresented (theso called artifacts).Example: if W1 = AATAAA, then W2 = ATAAAN is alsooverrepresented.
New Approach:
Once a dominating signal has been detected, we look for aweaker signal by comparing the number of observed occurrencesof patterns to the conditional expectations not the regularexpectations.
In particular, using the methodology presented above Denise and Regnier(2002) were able to prove that
E[On(W2)|On(W1) = k] ∼ αn
provided W1 is overrepresented, where α can be explicitly computed(often α = P (W2) is W1 and W2 do not overlap).
Polyadenylation Signals in Human Genes
Beaudoing et al. (2000) studied several variants of the well known AAUAAApolyadenylation signal in mRNA of humans genes. To avoid artifactsBeaudoing et al cancelled all sequences where the overrepresentedhexamer was found.
Using our approach Denise and Regnier (2002) discovered/eliminatedall artifacts and found new signals in a much simpler and reliable way.
Hexamer Obs. Rk Exp. Z-sc. Rk Cd.Exp. Cd.Z-sc. Rk
AAUAAA 3456 1 363.16 167.03 1 1
AAAUAA 1721 2 363.16 71.25 2 1678.53 1.04 1300
AUAAAA 1530 3 363.16 61.23 3 1311.03 6.05 404
UUUUUU 1105 4 416.36 33.75 8 373 .30 37.87 2
AUAAAU 1043 5 373.23 34.67 6 1529.15 12.43 4078
AAAAUA 1019 6 363.16 34.41 7 848.76 5.84 420
UAAAAU 1017 7 373.23 33.32 9 780.18 8.48 211
AUUAAA 1013 l 373.23 33.12 10 385.85 31.93 3
AUAAAG 972 9 184.27 58.03 4 593.90 15.51 34
UAAUAA 922 10 373.23 28.41 13 1233.24 –8.86 4034
UAAAAA 922 11 363.16 29.32 12 922.67 9.79 155
UUAAAA 863 12 373.23 25.35 15 374.81 25.21 4
CAAUAA 847 13 185.59 48.55 5 613.24 9.44 167
AAAAAA 841 14 353.37 25.94 14 496.38 15.47 36
UAAAUA 805 15 373.23 22.35 21 1143.73 –10.02 4068
Outline Update
1. What is Information?
2. Information Discovery in Massive Data: Pattern Matching
3. Information in Correlation: Finding Dependency
• Mutual Information• Alternative Splicing
4. Information in Structures: Network Motifs
How to Measure Dependency: Theoretical Background
Question: How to measure dependency between two parts of DNA?
Suppose we have two strings of unequal lengths
Xn1 = X1, X2, . . . , Xn
Y M1 = Y1, Y2, Y3, . . . . . . . . . , YM ,
where M ≥ n, taking valuesin a common finite alphabet Asuch that:
X1n
Y1M
I(X;Y)
We wish to differentiate between the following two scenarios:
(I) Independence: Xn1 and Y M
1 are independent.(II) Dependence. Find 1 ≤ J ≤ M − n + 1 so Y J+n−1
J depends on Xn1 .
How to Measure Dependency: Theoretical Background
Question: How to measure dependency between two parts of DNA?
Suppose we have two strings of unequal lengths
Xn1 = X1, X2, . . . , Xn
Y M1 = Y1, Y2, Y3, . . . . . . . . . , YM ,
where M ≥ n, taking valuesin a common finite alphabet Asuch that:
X1n
Y1M
I(X;Y)
We wish to differentiate between the following two scenarios:
(I) Independence: Xn1 and Y M
1 are independent.(II) Dependence. Find 1 ≤ J ≤ M − n + 1 so Y J+n−1
J depends on Xn1 .
To distinguish (I) and (II), we compute mutual information defined as
I(X; Y ) =∑
x,y∈A
V (x, y) logV (x, y)
P (x)Q(y)
where V (x, y) is joint distribution of X and Y . We conclude:(I) Independence implies Ij(n)→ 0 (since then V (x, y) = P (x)Q(x)),
(II) Dependence implies Ij(n)> 0.
Scenario (I): Independence
The multivariate central limit theorem (for pj) shows that nI(n) convergesin distribution to a (scaled) χ2 chi-squared distribution, that is,
(2 ln 2)nI(n)D−→ Z ∼ χ
2((|A| − 1)
2),
where Z has a χ2 distribution with k = (|A| − 1)2) degrees of freedom.
Therefore,
Pe,1 = Pr{declare dependence | independent strings}= Pr{I(n) > θ | independent strings}≈ Pr{Z > (2 ln 2)θn},
hence
Pe,1 ≈ 1 − γ(k, (θ ln 2)n)
Γ(k)≈ exp
(
− (θ ln 2)n)
.
Scenario (II): Dependence
Here, I(n) → I with probability one at rate 1/√n, that is,
√n[I(n) − I]
D−→ V ∼ N(0, σ2).
Therefore:
Pe,2 = Pr{declare independence | -dependent strings}= Pr{I(n) ≤ θ |W -dependent strings}
≈ Pr{V ≤ [θ − I]√n} ≈ exp
(
− (I − θ)2
2σ2n)
,
Practical Approach:In practice declaring false positives is much more undesirable thanoverlooking potential dependence. In our experiments we decide on anacceptably small false-positive probability ǫ, and then select thershold θ
such that
Pe,1 = Pr{I(n) > θ | independent strings} ≈ ǫ.
Alternative Splicing
Dependency exists between alternative splicing locations?How to measure and discover them?
Note: Only 25.000 genes must produce 100,000 proteins.
Alternative Splicing in zmSRp32
Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a
number of exons (shaded boxes) and introns (lines) flanked by the 5’ and 3’ untranslated
regions, UTR (white boxes). The gene zmSRp32 is coded by 4735 nucleotides and has four
alternative splicing variants. Two of these four variants are due to different splicings of this
gene, between positions 1–369 and 3243–4220.
Results: Finding Dependency
3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 42000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Base position on zmSRp32 gene seq
Mut
ual I
nfor
mat
ion
3200 3300 3400 3500 3600 3700 3800 39000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Base position on zmSRp32 gene seq
Mut
ual I
nfor
mat
ion
(a) (b)
3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 42000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Base position on zmSRp32 gene seq
Mut
ual I
nfor
mat
ion
3200 3300 3400 3500 3600 3700 3800 39000
0.01
0.02
0.03
0.04
0.05
0.06
Base position on zmSRp32 gene seq
Mut
ual I
nfor
mat
ion
(c) (d)
Figure 2: Ij versus j for various groupings of nucleotides]: In (a) and (c), mutual information between
the exon found between 1−78 and the intron located between 3243−4220, over: (a) four letter-alphabet
{A,C,G, T}; and (b) the Watson-Crick pairs {AT,CG}. In (b) and (d), mutual information between the
[exon located between 1 − 369 and the intron between 3243 − 4220, for the (b) four letter alphabet and
the (d) two letter grouping of purine/pyrimidine {AG,CT}.
Conclusions
There results strongly suggests that: there is significant dependencebetween the bases in positions 1–369 and certain substrings of the basesin positions 3243–4220.
The 1–369 region contains the 5’ untranslated sequences (UTR), an intron,and the first protein coding exon, the 3243–4220 sequence encodes anintron that undergoes alternative splicing
We narrow down the mutual information calculations to the 5’untranslated region (5’UTR) in positions 1–78 and the 5’UTR intron inpositions 78–268.
A close inspection of the resulting mutual information graphs indicatesthat the dependency is restricted to the alternative exons embeddedinto the intron sequences in positions 3688–3800 and 3884–4254.
Outline Update
1. What is Information?
2. Information Discovery in Massive Data: Pattern Matching
3. Information in Correlation
4. Information in Structures: Network Motifs
• Biological Networks• Finding Biologically Significant Structures• Information Embedded in Graphs
Modularity in Protein-Protein Interaction Networks
• A functionally modular group of proteins (e.g. a protein complex) islikely to induce a dense subgraph
• Algorithmic approaches target identification of dense subgraphs
• An important problem: How do we define dense?
– Statistical approach: What is significantly dense?
RNA Polymerase II Complex Corresponding induced subgraph
Significance of Dense Subgraphs
• A subnet of r proteins is said to be ρ-dense if the number of interactions(edges), F (r), between these r proteins is ≥ ρr2, that is,
F (r) ≥ ρr2
• What is the expected size, Rρ, of the largest ρ-dense subgraph in arandom graph?
– Any ρ-dense subgraph with larger size is statistically significant!– Maximum clique is a special case of this problem (ρ = 1)
• G(n, p) model
– n proteins, each interaction occurs with probability p
– Simple enough to facilitate rigorous analysis
• Piecewise G(n, p) model
– Captures the basic characteristics of PPI networks
• Power-law model
Largest Dense Subgraph on G(n, p)
Theorem 1 (Koyuturk, Grama, W.S., 2006). If G is a random graph with n
nodes, where every edge exists with probability p, then
limn→∞
Rρ
logn=
1
Hp(ρ)
where
Hp(ρ) = ρ logρ
p+ (1 − ρ) log
1 − ρ
1 − pdenotes divergence. More precisely,
P (Rρ ≥ r0) ≤ O
(
log n
n1/Hp(ρ)
)
,
where
r0 =log n − log logn + logHp(ρ)
Hp(ρ)
for large n.
Piecewise G(n, p) Model
• Few proteins with many interacting partners, many proteins with fewinteracting partners
Captures the basic characteristicsof PPI networks,where pl < pb < ph.
ph > pb > p1
Vh V1
• G(V,E), V = Vh ∪ Vl such that nh = |Vh| ≪ |Vl| = nl
P (uv ∈ E(G)) =
ph if u, v ∈ Vh
pl if u, v ∈ Vl
pb if u ∈ Vh, v ∈ Vl or u ∈ Vl, v ∈ Vh.
• If nh = O(1), then P (Rρ ≥ r1) ≤ O
(
log n
n1/Hpl(ρ)
)
where
r1 =logn − log logn + 2nh logB + logHpl
(ρ) − log e + 1
Hpl(ρ)
and B = (pb(1 − pl))/pl + (1 − pb).
SIDES
• An algorithm for identification of Significantly Dense Subgraphs (SIDES)
– Based on Highly Connected Subgraphs algorithm (Hartuv & Shamir,2000)
– Recursive min-cut partitioning heuristic– We use statistical significance as stopping criterion
p << 1p << 1
p << 1
Behavior of Largest Dense Subgraph Across Species
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 44
6
8
10
12
14
16
18
20
22
24
BT AT
OS
RNHP
EC
MM
CE
HS
SC
DM
ObservedGnp modelPiecewise model
Number of proteins (log-scale)
Size
of
larg
est
de
nse
sub
gra
ph
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 40
2
4
6
8
10
12
BT AT
OS RN
HP
EC
MM
CE
HS
SC
DM
ObservedGnp modelPiecewise model
Number of proteins (log-scale)
Size
of
larg
est
de
nse
sub
gra
ph
ρ = 0.5 ρ = 1.0
Number of nodes vs. Size of largest dense subgraphfor PPI networks belonging to 9 Eukaryotic species
Outline Update
1. What is Information?
2. Information Discovery in Massive Data: Pattern Matching
3. Information in Correlation
4. Information in Structures: Network Motifs
• Biological Networks• Finding Biologically Significant Structures• Information Embedded in Graphs
Graph Information Content
The entropy of a random (labeled) graph process G is defined as
HG = E[− logP (G)] = −∑
G∈GP (G) logP (G).
A random structure model is defined for an unlabeled version. Somelabeled graphs have the same structure.
1 1 1 1
1 1 1 1
2 2 2 2
2 2 2 2
3 3 3 3
3 3 3 3
G1 G2 G3 G4
G5 G6 G7 G8
S1 S2
S3 S4
Random Graph Model Random Structure Model
The probability of a structure S is: P (S) = N(S) · P (G)
N(S) is the number of different labeled graphs having the same structure.
The entropy of a random structure S can be defined as
HS = E[− logP (S)] = −∑
S∈SP (S) logP (S),
where the summation is over all distinct structures.
Automorphism and Erdos-Renyi Graph Model
Graph Automorphism:
For a graph G its automorphismis adjacency preserving permutationof vertices of G.
a
b c
d e
The Erdos and Renyi model G(n, p) generates graphs with n vertices,where edges are chosen independently with probability p.
If G in G(n, p) has k edges, then P (G) = pkq(n2)−k, where q = 1 − p.
Automorphism and Erdos-Renyi Graph Model
Graph Automorphism:
For a graph G its automorphismis adjacency preserving permutationof vertices of G.
a
b c
d e
The Erdos and Renyi model G(n, p) generates graphs with n vertices,where edges are chosen independently with probability p.
If G in G(n, p) has k edges, then P (G) = pkq(n2)−k, where q = 1 − p.
Theorem 2 (Y. Choi and W.S., 2008). For large n and all p satisfying lnnn ≪ p
and 1 − p ≫ lnnn (i.e., the graph is connected w.h.p.),
HS =(n
2
)
h(p)−logn!+o(1) =(n
2
)
h(p)−n logn+n log e−1
2log n+O(1),
where h(p) = −p log p − (1 − p) log (1 − p) is the entropy rate.
AEP for structures: 2−(n2)(h(p)+ε)+log n! ≤ P (S) ≤ 2−(n2)(h(p)−ε)+log n!.
Automorphism and Erdos-Renyi Graph Model
Graph Automorphism:
For a graph G its automorphismis adjacency preserving permutationof vertices of G.
a
b c
d e
The Erdos and Renyi model G(n, p) generates graphs with n vertices,where edges are chosen independently with probability p.
If G in G(n, p) has k edges, then P (G) = pkq(n2)−k, where q = 1 − p.
Theorem 2 (Y. Choi and W.S., 2008). For large n and all p satisfying lnnn ≪ p
and 1 − p ≫ lnnn (i.e., the graph is connected w.h.p.),
HS =(n
2
)
h(p)−logn!+o(1) =(n
2
)
h(p)−n logn+n log e−1
2log n+O(1),
where h(p) = −p log p − (1 − p) log (1 − p) is the entropy rate.
AEP for structures: 2−(n2)(h(p)+ε)+log n! ≤ P (S) ≤ 2−(n2)(h(p)−ε)+log n!.
Sketch of Proof: 1. HS = HG − logn! +∑
S∈S P (S) log |Aut(S)|.2.
∑
S∈S P (S) log |Aut(S)| = o(1) by asymmetry of G(n, p).
SZIP Algorithm
Asymptotic Optimality of SZIP
Theorem 3 (Choi, W.S., 2008). Let L(S) be the length of the code
generated by our algorithm for all graphs G from G(n, p) that are
isomorphic to a structure S.
(i) For large n,
E[L(S)] ≤(n
2
)
h(p) − n logn + n (c + Φ(logn)) + o(n),
where h(p) = −p log p − (1 − p) log (1 − p), c is an explicitly computable
constant, and Φ(x) is a fluctuating function with a small amplitude or zero.
(ii) Furthermore, for any ε > 0,
P (L(S) − E[L(S)] ≤ εn logn) ≥ 1 − o(1).
(iii) Finally, our algorithm runs in O(n+e) on average, where e is the number
of edges.
Our algorithm is asymptotically optimal up to the second largest term, andworks quite fine in practise.
THANK YOU