the information processing mechanism of dna and efficient dna storage olgica milenkovic university...
Post on 22-Dec-2015
228 views
TRANSCRIPT
![Page 1: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/1.jpg)
The Information Processing
Mechanism of DNA and Efficient DNA Storage
Olgica Milenkovic University of Colorado, Boulder
Joint work with B. Vasic
![Page 2: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/2.jpg)
Outline
PART I: HOW DOES DNA ENSURE ITS DATA INTEGRITY? Information Theory of Genetics: an emerging discipline Error-Correction and Proofreading in genetic processes
What type of codes “operate” at the level of bio-chemical processes of the Central Dogma?
Spin Glasses, Kaufmann’s “NK” Model, Regulatory Network of Gene Interactions and Low-Density Parity-Check (LDPC) Codes
Cancer, dysfunctional proofreading and chaos theory PART II: HOW DOES ONE STORE DNA? (DNA
COMPRESSION) Structure of DNA: Statistics and Modeling DNA Compression Genome Compression New Distance Measures and One-Way Communication PART III: NEW CODING PROBLEMS IN GENETICS
![Page 3: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/3.jpg)
Information theory of genetics
2003: 50th Anniversary of discovery: DNA has a double-helix structure!
(Crick, Watson, Franklin, Wilkins 1953)
2003: Completion of the Human Genome Project (98% HDNA sequenced)
Every day an average of 15 new sequences added to SwissProt+GeneBank
Vast amount of genetic data just starting to be analyzed!
DNA is a CODE, but very little is known about its
• exact information content
• nature of redundancy
• statistical properties
• secondary structure
• influence on disease development and control
• underlying error-correcting mechanism
![Page 4: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/4.jpg)
Information Theory of DNA
Helps in understanding the
EVOLUTION of DNA
FUNCTIONALITY of DNA
DISEASE DEVELOPMENT
IT community still not involved in this area!
Signal Processing Community is just getting involved:
Special Issue of Signal Processing Journal devoted to Genetics, 2003.
![Page 5: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/5.jpg)
The League of Extraordinary Gentlemen…
![Page 6: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/6.jpg)
I
How is information stored in a genetic sequence?
What are the atoms of information?
![Page 7: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/7.jpg)
The DNA Polymer…
Sugar
Sugar
PO4
PO4
PO4
1’
S B
U A
G C
A K
R B
- O
P N
H E
O
S
P
H
A
T
E
Deoxiribose (Sugar)
O
OH H
OHCH2OH
H H
H H
2’3’
4’
5’
![Page 8: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/8.jpg)
The Bases…
Purine Bases: Adenine (A); Guanine (G)
Pyramidine Bases:Thymine (T); Cytosine (C)
D
O
U
B
L
E
-
H
E
L
I
X
![Page 9: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/9.jpg)
The Pairing Rule…
A and T paired through TWO hydrogen bonds
G and C paired through THREE hydrogen bonds
![Page 10: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/10.jpg)
The Genetic Code…Second Letter
U C A G
First Letter
U
UUUleu
UCU
ser
UAUtyr
UGUcys U
CAG
Third Letter
UUC UCC UAC UGC
UUAleu
UCA UAA stop UGA stop
UUG UCG UAG stop UGG trp
C
CUA
leu
CCU
pro
CAUhis
CGU
arg
UCAG
CUC CCC CAC CGC
CUA CCA CAAgin
CGA
CUG CCG CAG CGG
A
AUU
ile
ACU
thr
AAUasn
AGUser U
CAG
AUC ACC AAC AGC
AUA ACA AAAlys
AGAarg
AUG met ACG AAG AGG
G
GUU
val
GCU
ala
GAUasp
GGU
gly
UCAG
GUC GCC GAC GGC
GUA GCA GAAglu
GGA
GUG GCG GAG GGG
Abbreviations
ala = alaninearg = arginineasn = asparagineasp = aspartic acidcys = cysteine
gln = glutamineglu = glutamic acidgly = glycinehis = histidineile = isoleucine
leu = leucinelys = lysinemet = methioninephy = phenylalaninepro = proline
ser = serinethr = threoninetrp = tryptophantyr = tyrosineval = valine
![Page 11: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/11.jpg)
Genes, Exons, Introns (Junk DNA)…
Genes: Sequence of base pairs coding for chains of amino-acidsConsist of exons (coding) and introns (non-coding) regionsLength- anything between several tenths up to several millionsEXAMPLE: Among most complex identified genes isDYSTROPHINE (2 million bps, more than 60 exons, codes for 4000 amino acids)
Escherichia Coli: around 4000 genes; Humans: 35000-40000 genes
Junk DNA: “Disrespectful” name for introns Significant fraction of DNA
Shown (last year) to be “somewhat” responsible for RNA coding (Far from being “junk”, but function still not well understood…)
![Page 12: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/12.jpg)
The Central Dogma…
DNA
Replication Transcription Translation
mRNA
Proteins
A Communication Theory Perspective:
Genetic ChannelDNA sequence
DNA sequence
mRNA
Proteins
What kind of errors are introduced by the Genetic Channel?
![Page 13: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/13.jpg)
DNA within Chromosomes (tight packing):
• DNA wrapped around HISTONES (proteins)
• HISTONES are organized in NUCLEOSOMES
• NUCLEOSOMES CHROMATINE folded in CHROMOSOMES
Processing in the Genetic Channel: DNA REPLICATION
Untying the knots: Topoisomerases
Unwinding the helix: Helicases
Getting it all started: Primers
Doing the hard work: Polymerases
Sealing the segments: Ligases
Helping to keep two sides apart: SSB
![Page 14: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/14.jpg)
Replication: more details
Rules: Replication always proceeds in 5’ to 3’ direction;
Replication is semi-conservative;
Replication is a parallel process for eukaryotes;
Facts: Polymerases can stitch together any combination
of bases (“Ps are a little bit sloppy’’)
Timing for replication:
E. Coli: 40 min
Humans (parallel): < 2 hours
Can be prolonged for proofreading purposes
![Page 15: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/15.jpg)
Errors…Combination of substitution, deletion, insertion (replication fork), shift, reversal, etc
errors(Complete exon or intron deleted, or simple base pair deletions)1. Tautomeric shifts (transition/transvertion): *T-G, *G-T, *C-A, *A-C 2. Recombination between non-identical molecules (“HETERODUPLEX mismatches”)3. Spontaneous DEAMINATION (C to U, C to T, C-G to T-A), METHYLATION (CpG),
rare4. APURINIC/APYRAMIDINIC SITES (due to HYDROLISIS)5. CROSS-LINKS6. STRAND-BREAKAGE, OXIDATIVE DAMAGE ERRORS7. LOSS OF 5000-10000 PURINE and 200-500 PYRIMIDINE bases (20 hours) due to
radiation
Replication Errors: Polymerases miss-insertion probability between 10e-3/10e-5 Miscoding A-G-A-T-GC-T-G-C-T-A-C
Slippage
A-A-T-G
C-G-T-T A-C
T
Miscoding - Realignment
A-G-A-T-G C-T C-T-A-C
G
Slippage-Dislocation
G-A-A-T-G
C-G -T -T-T-A-C
![Page 16: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/16.jpg)
Bio-chemical mechanism responsible for error correction?
Proofreading (Maroni, Molecular and Genetic Analysis of Human Traits): Replication polymerases error rate ; human DNA with bps, total of
106 errorsExample: C to U conversion causes presence of deoxyuridine, detected by uracil-DNA
GLYCOSYLASEGlycosylase process acts like erasure channel1. Proofreading based on semi-conservative nature of replication 2. Excision Repair Mechanisms: Arrays of Exonucleases Show large degree of pre-correction binding activity – correction performed by
EXCISION“Jumping’’ occurs between different genes !!! (Lin, Lloyd, Roberts, Nucleases)
Reduce error levels by an additional several orders of magnitudeMismatch-specific post-replication enzymesTotal number of errors per human DNA replication: on average JUST ONE
Replication and Repair have been optimized for balancing spontaneous mutational load:
Permitting evolution without threatening fitness or survival
310 9103
![Page 17: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/17.jpg)
Characteristics of DNA ECC:
Error-correction performed on different levelsError correction performed in very short timeExtremely large number of very diverse errors
correctedError correction tied to global structure of DNA(not to consecutive base pairs) Error correction also depends on DNA topology
![Page 18: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/18.jpg)
Identify ECCs of DNA… Error-Correcting Codes in DNA: Forsdyke (1981), Wolny (1983), Eigen
(1993), Liebovitch et al (1996), Battail (1997), Rosen and Moore (2003), McDonaill (2003)
Theories: Non-coding regions are in-series error detecting sequences! Ordering of coding/non-coding regions responsible for error-correction! Complementary base pairing corresponds to error-detecting code!
Acceptor/Donor: hydrogen atom/lone electrons
1 represents donor, 0 acceptor
Additionally, add 0 or 1 for purine and pyramidine
Code: A 1010
G 0110
T 0101
C 1001
![Page 19: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/19.jpg)
BEST ERROR CORRECTING MECHANISM: Deinococcus radiodurans • Microbe with extreme radiation resistance • Enabled to survive radiation doses thousands of times higher than would kill most organisms, including humans. • Surpasses the cockroach by orders of magnitude!
Why? Because of its remarkable DNA-repair mechanism!!! D. radiodurans flawlessly regenerates its radiation-shattered genome in about 24 hours.
‘’Conan The Bacterium’’
(to conquer the Red Planet !)
![Page 20: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/20.jpg)
Something seemingly unrelated…
nm nnnnmnmiji xhxxJhJxE
,,, 2
1)}{,}{;(
Spin Glasses, the Ising Model, Hopfield Networks or “Boltzmann Machines”:
State x of a spin glass with N spins that may take values in {-1,+1}
Energy of the state x: E, external field h
nm nnnm xhxxJhJxE
,2
1),;(
The Hamiltonian
Hamiltonian for Ising model
+
++
+
-+
Example:
Water exists as a gas, liquid or solid, but
all microscopic elements are H2O molecules
This is due to intermolecular interactions depending on temperature, pressure etc.
“frustration”
![Page 21: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/21.jpg)
Something seemingly unrelated…Codes on graphs: the most powerful class of error correcting codes in information theory, including Turbo, Low-Density Parity-Check (LDPC), Repeat-Accumulate (RA) Codes
1 1 1 0 1 0 0
0 1 1 1 0 1 0
1 0 1 1 0 0 1
H
Most important consequence of graphical
description: efficient iterative decoding
Variable nodes communicate to check nodes their reliabilityCheck nodes decide which variables are unreliable and “suppress” their inputsNumber of edges in graph = density of H Sparse = small complexity
Detrimental for convergence of decoder: presence of short cycle in code graphApplications of LDPC codes: for cryptography, compression, distributed source coding for sensor networks, error control coding in optical, wireless comm and magnetic and optical storage…
Variables Checks
![Page 22: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/22.jpg)
Gallager’s Decoding Algorithm A
Works for (Binary Symmetric Channel) BSC:
Each variable sends its channel reliability unless all incoming messages from checks say “change”
Each check sends estimate of the bit based on modulo two sum of other bits participating in the check
Alternative view: Variables=Atoms; Binary Values=Spins;
Variables “align” or “misalign” according to interaction patterns
)(...2121 ... iiiiiiii ShSSSJH
rr
LDPC equivalent to diluted spin glasses
Ground state search for above Hamiltonian = maximum aposteriori decoding of codeword
Average magnetization at a site = MAP decision for individual variable
![Page 23: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/23.jpg)
Something seemingly unrelated…
The regulatory Network of Gene Interactions (RNGI)
Kaufmann (1960’s): “NK” Evolution through Changing Interactions between Genes
Life exists at the Edge of Chaos! BASED ON SPIN GLASSES!
RANDOM BOOLEAN FUNCTION MODEL:
Evolution carried by genes, not base pairs, and the way genes interact!
T T+1
G1 G2 G3 G1 G2 G3
0 0 0 0 0 1
0 0 1 0 0 1
0 1 0 1 0 1
0 1 1 0 0 0
1 0 0 1 0 1
1 0 1 0 1 0
1 1 0 0 0 1
1 1 1 0 1 1
G1
G2
G3
G1G2 G1
0 0 0
0 1 1
1 0 0
1 1 0
G1G3 G2
0 0 0
0 1 0
1 0 0
1 1 1
G1G2 G3
0 0 1
0 1 1
1 0 0
1 1 1
![Page 24: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/24.jpg)
Chaos, Attractors, ConnectivityBoolean networks: dynamical systems Attractors: point and periodic
Characterized by network topology+ Number and period lengths
choices of Boolean node functions
100 111
000 011
001 101
110 010
Kimatograph of the network
MOST IMPORTANT topological factor:
CONNECTIVITY
KEY: Sparse connectivity allows enough variability for evolutionary processes, produces self-organizing structures, but doesn’t allow the system to “get trapped in” chaotic behavior
MOST IMPORTANT Boolean function factors:
BIAS (number of 1 outputs)
CANNALIZATION (depends on number of inputs determining output)
![Page 25: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/25.jpg)
The NK model and RNGI
N= number of genes; K=number of genes co-interacting with one given gene
K=2 critical value (mainly frozen states with islands of changing interaction)
Interaction between genes in regulatory network: very limited in scope
K ranges everywhere between 2-3 to 10-15: If we check carefully, logarithmic in N, i.e. number of genes
Between 2 and 3 for Escherichia Coli (around a thousand genes)
4 and 8 for higher metazoea (several thousand genes) Can explain the process of cell differentiation: genetic material of each cell the same, yet cells functionally and morphologically very differentEach cell type CORRESPONDS TO ONE GIVEN ATTRACTOR of the RNGI
Counting attractors for networks with N=40000 genes, K=2 gives Cell types (correct number 258).
260
![Page 26: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/26.jpg)
KEY IDEA: LDPC Code with Given Decoding Algorithm is a BOOLEAN
NETWORK, SPIN GLASS,…
Example: LDPC Code under Gallager’s A Algorithm
LDPC Code:
Variables and Checks
LDPC Code:
The Control Graph
In the Control Graph, edge (i,j) exists if i-th bit controls j-th bit (i.e. if i and j are at distance exactly two)Boolean function determined by decoding algorithm: For Gallager’s A algorithm, takes form of truncated/periodically repeated MORSE-THUE sequence
Morse-Thue: 0 1 2 3 4 5 6 7 …
0 1 10 11 100 101 110 111 …
0 1 1 0 1 0 0 1 …
Properties:
Self-Similar (fractal)
Results in unbiased Boolean functions
G1
G2
G3
G4
![Page 27: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/27.jpg)
Use Boolean Network Analysis for LDPC Codes No cycles of length four, code regular: uniform choice for Boolean
function Cycles of length four: Boolean functions vary, many more
attractors In no case are the functions canalizing1 2 1
( 1) ... (1 (1 )...(1 )) mod2s si i i i i i i iF G N N N G N N
1 2, ,...,
si i iN N N1 2, ,...,
si i iC C Cmodulo two sums of variable nodes connected to controls
( ,0) ( ,1) ( , )( ) 1 1 1, ,..., , , ,...,j j j lj j j N
j
fff x f x x x x l x x
x
Can use mean-field theorems to see when initial perturbations in the codewords disappear in the limit: use the Boolean derivative, sensitivity analysis, iterative Jacobian and Lyapunov exponent (as in Schmulevich et.al):
N N ( )ijfJacobian F is a
matrix with
in entry (i,j).
1 1( 1) ( ) ( ) ( ( ) ( )), (( ,..., )) ( ( ),..., ( )),
0, 0( )
1, 0
N N
ii
i
d t F t d t H F t d t H g g H g H g
gH g
g
![Page 28: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/28.jpg)
( ) (0) (1) ... ( )IJ t F F F t
1( ) log ( )
tT t
T 2
,,
( ) | ( 1)|/ | ( )|, | | (1/ )N
i ji j
t IJ t IJ t M N M
, ix 1 2, ,..., Nx x x
( ) [ / ] { / 1}i i iI f E f x P f x
The influence of variable on the Boolean function f is defined as the expectation of the partial derivative with respect to the distribution of the variables
.
Influence carries important information about frozen states, error susceptibility etc.
Use Boolean Network Analysis for LDPC Codes
Iterated Jacobian:
Lyapunov exponent:
( ) ( ) (1 ( ))K i iKs t c s t s t
i
Control of the chaotic phase in the a Boolean network by means of periodic pulses (with period T) that “freeze” a fraction of nodes
iterative change of size of “stable core”
![Page 29: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/29.jpg)
LDPC Codes and Gallager’s A Decoding Algorithm
1 2 3 1 2 3 1 2 3( , , ) (1 ) (1 )(1 )f z z z z z z z z z 1 2 3 1 2 3 1 2 3 2 1 3
1 2 3 3 1 2
( , , )/ 1 , ( , , )/ ,
( , , )/
f z z z z z z f z z z z z z
f z z z z z z
A (B)=C1 (C)=C2 (D)=C3 F3(A) A (B C) D C1 C2 F1(A) A (B C D) C1 F2(A)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1
0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1
0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0
0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1
0 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0
0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0
0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1
1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1
1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1
1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0
1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1
1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0
1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0
1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
![Page 30: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/30.jpg)
New decoding methods for LDPC and other Block Codes…
Work in Progress:
• Decoders that don’t operate on the frozen core• Decoders that periodically freeze some variables to avoid chaotic behavior• Iterative decoders that work for asymmetric channels and channels with insertion/deletion errors
![Page 31: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/31.jpg)
![Page 32: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/32.jpg)
Bold Conjecture:
The ECC of DNA Replication operates on multiple levels
Carrier of information is gene, not base pair
The Global level involves Genes; Local levels may involve exons or base pairs in general;
The Global Code is an LDPC Code!
Wigner observed that the same mathematical concepts turn up in entirely unexpected connections in whole of
science…
(no explanation as of yet)LDPC related to statistical physics (spin glasses) to neural
networks to self-organizing systems to …
R. Sole and B. Goodwin, Signs of Life: How Complexity Pervades Biology
![Page 33: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/33.jpg)
Table 1: Example of 15-node regulatory network in terms of gene controls
Gene Controls Control (after addition)
1 2,3,13 2,313
2 1,3 1,3
3 4,5,6 4,5,6,1,2
4 5,6 5,6,3
5 4,9,6 4,9,6
6 5,9,3,4,7,8 3,4,5,7,8,9
7 15,8,6 15,8,6
8 7,9 7,9
9 4,5,6 4,5,6
10 9,13,15 9,13,15
11 8,12,13 8,12,13
12 11,13,15 8,11,13,15
13 8,14,15 8,14,15,11,12
14 X X
15 11,12 11,12,13,14,10,7
The Corresponding LDPC Code
1 2 3 4 5 6 7 8 9 101
11
21
31
41
5
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
7 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1
100 0 0 0 0 0 0 1 0 0 1 1 1 0 0
110 0 0 0 0 0 0 0 0 0 0 0 1 1 1
120 0 0 0 0 0 0 0 0 0 1 1 0 0 1
15-gene interaction example by Hashimoto (Shmulevich, Anderson
Cancer Center)
Need q-ary LDPC code corresponding to different levels of interaction
![Page 34: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/34.jpg)
Cancer: genetic disorder of somatic cellsHuman cancer: INDUCED and SPONTANEOUS
• Accumulation of mutant (erroneous) genes that control cell cycle, maintain genomic stability, and mediate apoptosis
• Causes of mutation: depurination and depyrimidation of DNA; proofreading and mismatch errors during DNA replication
•Deamination of 5-methylcytosine to produce C to T base pair substitutions; and damage to DNA and its replication imposed by products of metabolism (notably oxidative damage caused by oxygen free radicals)
• Defective DNA excision-repair; low levels of antioxidants, antioxidant enzymes, and nucleophiles that trap DNA-reactive electrophiles; and enzymes that conjugate nucleophiles with DNA-damaging electrophiles
![Page 35: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/35.jpg)
Cancer Research
To summarize: Various forms of cancer tightly linked to malfunctioning of proofreading (ECC) mechanism
Cancer cells: correspond to a special type of attractor of the RNGI
(A cancer cell is “just another configuration” of RNGI)
(Schmulevich et.al., Anderson Cancer Research Center)
This attractor has genes interacting in a way that results in uncontrolled cell division
Key observation: C-Change in RNGI results in further weakening of the proofreading system, and VV
![Page 36: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/36.jpg)
Example 1: Cancer cells cheat the proofreading mechanism regulating reduction in length of
telomeres
Aging: during each cell division, telomeres get shorter and shorter…
When they become too short, errors in replication happen, leading to cancer
(a time bomb in our body)
Cancer cells “cheat” proofreading mechanism and allow telomeres to maintain constant length
Finding the error-control mechanism: classifying diseases accurately, curing diseases (including cancer) by gene therapy, making telomer lengths constant over long time…
![Page 37: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/37.jpg)
Example 2: Breast Cancer Oncogene BRCA1 tightly linked to error-control of DNA and cell division regulation
![Page 38: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/38.jpg)
How to obtain results practically? DNA Microarrays!
Figure taken from Schmulevich et.al.
![Page 39: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/39.jpg)
II
How can one efficiently store DNA sequences?
![Page 40: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/40.jpg)
DNA Storage: Compression
GenBank/Swiss-Prot: storage of large number of DNA and protein sequences (17471 million sequences in GenBank, 2002)
Every day, an average of 15 new sequences added to database
DNA compression absolutely necessary to maintain banks Fractal DNA structure to be exploited Possible use of Tsallis entropy Need novel compression algorithms DNA sequences of related species differ in very small
percent of base pairs: need cross-reference compression Need meaningful definition of DNA distance
-- major paradigm shift from base-pair distance to chromosomal distance --
![Page 41: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/41.jpg)
Statistical properties of DNA sequencesBases within the human mitochondrion (length approximately 17000) appear with the following frequencies:
A T G C
0.31 0.13 0.25 0.31
while within different regions of human fetal globin gene:
Introns A T G C
0.27 0.29 0.27 0.17
Exons A T G C
0.24 0.22 0.28 0.25
Parts of genetic sequences can be modeled by Markov chains of given order and transition probabilities; order 2-7
BPs, like CG, have very small probability: most notorious triplet repeats, related to Huntington’s disease and Fragile-X mental retardation, consist of these very unlikely “CG” pairs: (CGG)m ,(CCG)m, m = number of repetitions;
Regions of uniform distribution: isochors; can stretch in length up to hundreds KbpsRepetitive patterns: tandem repeats (TR), random repeats (RR), short interspersed repeat sequences (SINE’s, 9% of DNA), long interspersed repeat sequences (LINE’s).
Junk-DNA seems to have long-range (fractal) characteristics.
![Page 42: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/42.jpg)
A fractal patterns arises from the so-called DNA walk: a graphical representation of the DNA sequence in which one moves up for C or
T and down for A or G.
Can have two, three-dimensional random walk: further differentiation A,G,C,T
C A T GFractal dimension of the DNA molecule:0.85 for higher species, 1 for lower
Use lingual analysis of human languages for exploring DNA "language" (Zipf method)
http://library.thinkquest.org/26242/full/ap/ap13.htmlDNAWalker http://athena.bioc.uvic.ca/pbr/walk/
![Page 43: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/43.jpg)
DNA and Cantor Sets
Provata and Almirantis, 2003: Fractal Cantor pattern in DNA
Exons - filled regions
Introns - empty regions
Random, fractal, Cantor-like set
Implication: atom (carrier of information) exon/intron pairs
History-based random walk and DNA description in terms of urn models
Only introns in higher species have higher complexity than in lower species
Both coding and non-coding regions exhibit long range correlation, with spectral density of introns
bf/1
![Page 44: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/44.jpg)
Known algorithms
FILE COMPRESSION
RATE (ACHIEVABLE
)
GZIP ARITHM. VPS2A UNIXCOMPRESS
BIO-COMPRESS
BWT GTAC
Human GrowthHormone
(HUMGHCSA)
2.00 2.065 2.052 1.607 2.19 1.31 1.608 1.1
GenCompress (Chen, ’97)Biocompress (Grumbach/Tachi, ‘94)Fact (Rivals, ’00)GenomeSequenceCompress (Sato et.al 00’)
Use characteristics of DNA like repeats, reverse complements…Compression rate is about 1.74 bits per base (78% in compression ratio)
Two classes: statistical and grammar based compression algorithmsHuffman, Lempel-Ziv, Arithmetic Coding, Burrows-Wheeler,
Kieffer’s Grammar Based Schemes(with DNA specific modifications)
No known algorithm specially suited for fractal nature of DNA, although 90% fractal!
![Page 45: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/45.jpg)
Different Entropy Measures:
Shannon Entropy: i
iiS pppH log)(
Renyi Entropy:
)1/(1)( qppHi
qiT
Tsallis Entropy:
i
qiR p
qpH log
1
1)(
Hausdorff Dimension:
TE non-additive in the way that for two independent PS A,B
)()()1()()()( BHAHqBHAHBAH TTTTT
n
zzpHpH
n
nT
derivedS
1limlog)()(
)/1(log
loglim
0 r
Nr
![Page 46: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/46.jpg)
Approach: Use “Fractal Grammars”
Barthel, Brandau, Hermesmeier, Heising: Fractal Prediction, 1997Zerotree wavelet coding using fractal prediction
Inference of context-free grammars from fractal data setsSyntactic generation of fractalsTheory of formal languages can be used to state the problem of "syntactic fractal pattern recognition" Explore Connections with Wavelets(ideas by Jacques Blanc-Talon)
Example: Heighway dragon and Koch curve
haghhbcabbabhaa
hfdbxxx
aggegeeccaca
ahgfedcbaG
)(,...,)(,)(
},,,{,)(
,)(,)(,)(,)(
}}{},,{},,,,,,,,{{
222
1
1111
21
![Page 47: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/47.jpg)
How does one compress sets of related DNA Sequences?
Distributed Source Coding Problem: Peculiar Correlation Patterns
Could explore Wavelet Based CompressionDistributed Source Coding with LDPC Codes…
![Page 48: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/48.jpg)
Genomic Distance and One-Way Communication
Major paradigm shift in genetic distance measure: From base-pair distance (involving deletion, insertion and substitution):
Sankoff, Kruskal,Time Warps, String Edits, and Macromolecules) to Chromosomal Distance based on global arrangements of genes
Inversions are primary mechanism of genome rearrangement!
REVERSAL DISTANCEThe smallest number of inversions necessary to transform one genome
into anotherFinding the minimum number of reversals needed to “sort” a
permutation Permutations are signed, indicating direction of transcriptionExample: (+1 +3 +2) (+1 -2 -3) (+1 +2 -3) (+1
+2 +3)
How does one perform one-way communication (SENDING INFORMATION TO A RECEIVER WHO POSESESS CORRELATED INFORMATION) under the reversal distance measure?
![Page 49: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/49.jpg)
Uses molecular sequence reduction (MSR) algorithms similar to those used to match patterns in the study of DNA.
The algorithms identify and eliminate repetitions previously undetected in network traffic in wide area networks (Wans) to give compression ratios of between 1.2:1 for voice and video and 5:1 for SQL traffic.
The other way around:DNA compression methods increase network efficiency by up to 10
timesPeribit's SR-50 compressor
![Page 50: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/50.jpg)
III
Additional Coding Problems in Genetics
![Page 51: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/51.jpg)
DNA Computing
Codes with Constant GC Content and invariant under Watson-Crick Inversion
Microarray Error Control Coding
Using design theory to reduce error rate of DNA array data
Use novel clustering algorithms for DNA Array Data
![Page 52: The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d795503460f94a5bdfa/html5/thumbnails/52.jpg)
Conclusion
Genetics is the most exciting source of new ideas for coding theory
The atom of information is a gene, not a base pair or a triple of base pairs
The error control code of the genome is to be found operating on the level of genes
Compression, phylogenic tree construction: comparison of species has to be performed on the level of genes first
Once the genes are compared, can move to local base pair comparisons