algorithmic information theory and computational biology
Post on 05-Dec-2014
2.066 Views
Preview:
DESCRIPTION
TRANSCRIPT
Algorithmic Information Theory andComputational Biology
Hector Zenil
Unit of Computational MedicineKarolinska Institutet
Sweden
Hector Zenil AIT Tools for Biology and Medicine
Complex Adaptive Systems (CAS)
Hector Zenil AIT Tools for Biology and Medicine
Complexity is hard to quantify in biology
Mapping quantitative stimuli to qualitative behaviour
Hector Zenil AIT Tools for Biology and Medicine
Information Theory in Biology
Sequence alignment
Pattern recognition
Sequence logos
Binding site detection
Motif detection
Consensus sequences
Biological significance
[based on Claude Shannon’s Information Theory, 1940]Hector Zenil AIT Tools for Biology and Medicine
Algorithmic Information Theory
Which sequence looks more random?(a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
(b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC(c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC
Classical probability theory vs. Kolmogorov Complexity
Definition
KU(s) = min{|p|,U(p) = s} (1)
Compressibility
A sequence with low Kolmogorov complexity is c-compressible if|p|+ c = |s|. A sequence is random if K (s) ≈ |s|.
[Kolmogorov (1965); Chaitin (1966)]Hector Zenil AIT Tools for Biology and Medicine
Examples
Example 1
Sequences like (a) have low algorithmic complexity because theyallow a short description. For example, “20 times A”. No matterhow long (a) grows in length, the description increases only byabout log2(k) (k times A).
Example 2
The sequence (b) is algorithmic random because it doesn’t seem toallow a (much) shorter description other than the length of (b)itself.
For example, for sequence (a), a proof of non-randomness impliesthe exhibition of a short program. Compressibility is therefore asufficient test of non-randomness.
Hector Zenil AIT Tools for Biology and Medicine
Example of an evaluation of K
The sequence (b) GCGCGC...GC is not algorithmic random (or haslow K complexity) because it can be produced by the followingprogram (take G=0 and C=1):
Program A(i):1: n:= 02: Print n mod 23: n:= n+14: If n=i Goto 65: Goto 26: End
The length of A (in bits) is an upper bound of K (GCGCGC ...GC ).
Hector Zenil AIT Tools for Biology and Medicine
The ultimate measure of pattern detection and optimalprediction
Kolmogorov and Chaitin, Schnorr, and Martin-Lofindependently provided 3 different approaches to randomness(compression, predictability and typicality).
They proved (for infinite sequences):
incompressibility ⇐⇒ unpredictability ⇐⇒ typicality
When this happens in mathematics a concept has objectively beencaptured (randomness).
This is why prediction in biology is hard. AIT tells that no effectivestatistical test will succeed to recognise all patterns and nocomputable technique can fully predict all outcomes. The problemis deeply connected to computability and algorithmic informationtheory.
[Solomonoff (1964); Kolmogorov (1965); Chaitin (1969)]Hector Zenil AIT Tools for Biology and Medicine
Information distances and similarity metrics
Measures waiting to be introduced in bioinformatics
Information Distance ID(x , y) = max K (x |y),K (y |x)
Universal Similarity MetricUSM(x , y) = max K (x |y),K (y |x)/max K (x),K (y)
Normalised Information Distance:NCD(x , y) = K (xy)−min K (x),K (y)/max K (x),K (y) andNCD.
Normalized Compression Measure (NCM): NC (s) = K (s)/|s|(asymptotic behaviour)
Bennett’s Logical Depth:LDd(s) = min{t(p) : (|p| − |p∗| < d) and (U(p) = s)}
(e.g. of an app. see Zenil, Complexity 2011)
Hector Zenil AIT Tools for Biology and Medicine
Non-systematic but succesful attempts in biology
GenCompress is a compression algorithm to compress DNAsequences: d(x , y) = 1− (K (x)− K (x |y))/K (xy)
NCD applied to genetic similarity:
AIT looks at the genome as information, not as data (letters).Counting: traditional Shannon-entropy style sequencing.Interpreting: AIT. The full power of the theory hasn’t yet beenunleashed.
Hector Zenil AIT Tools for Biology and Medicine
To be or not to be...
Borel’s “Infinite Monkey” theorem
0
Syntax error
1
Input
∞∞
1024
“To be or not
to be, that is the
question.”
CH3
√2
π
Hector Zenil AIT Tools for Biology and Medicine
Algorithmic probability
Hector Zenil AIT Tools for Biology and Medicine
Producing π
This C-language code produces the first 1000 digits of π (GjerritMeinsma):
long k = 4e3, p, a[337], q, t = 1e3;main(j){for(; a[j = q = 0]+ = 2, k ; )for(p = 1 + 2 ∗ k ; j < 337; q = a[j ] ∗ k + q%p ∗ t, a[j + +] = q/p)k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); }
Producing non-random sequences:
If an object has low Kolmogorov complexity then it has a short descriptionand a greater probability to be produced by a random program. The lessrandom a string the more likely to be produced by a short program.
Hector Zenil AIT Tools for Biology and Medicine
Biological Big Data Analysis
The information bottleneck:
Small Data matters: Local measurements of information contentare a good indication of the global information content of an
object. Evidence: BDM Image classification. Compression works atlarge scales looking for long regularities, while BDM is very local.
Yet both yield astonishing similar results for this object sizes.
Hector Zenil AIT Tools for Biology and Medicine
Complementary methods for different sequence lengths
The methods to approximate K coexist and complement eachother for different sequence lengths.
short strings long strings scalability< 100 bits > 100 bits
Lossless compressionmethod ×
√ √
Coding Theoremmethod
√× ×
Block Decompositionmethod
√ √ √
[Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional KolmogorovComplexity and Validation of the Coding Theorem Method by
Compressibility (2012)]
Hector Zenil AIT Tools for Biology and Medicine
Coding Theorem method and lossless compression
The transition between one method and the other. What is complex forthe Coding Theorem method is less compressible.
[Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence ofNumerical Evaluations of Algorithmic Information Measures (2012)]
Hector Zenil AIT Tools for Biology and Medicine
Online Algorithmic Complexity Calculator
Provides: Shannon’s entropy, lossless compression (Deflate) values,Kolmogorov complexity approximations and relative frequency order(algorithmic probability).
A Mathematica API and an R module.
Datasets available online at the Dataverse Network.
Basic data analysis tool for shorts sequence comparison.
[http://www.complexitycalculator.com]
Hector Zenil AIT Tools for Biology and Medicine
Online Algorithmic Complexity Calculator 2
[http://www.complexitycalculator.com]
Hector Zenil AIT Tools for Biology and Medicine
Simulation of natural systems w/complex symbolic systems
An elementary cellular automaton (ECA) is defined by a localfunction f : {0, 1}3 → {0, 1},
f maps the state of a cell and its two immediate neighbours (range= 1) to a new cell state: ft : r−1, r0, r+1 → r0. Cells are updated
synchronously according to f over all cells in a row.
[Wolfram, (1994)]
Hector Zenil AIT Tools for Biology and Medicine
Behavioural classes of CA
Wolfram’s classes of behaviour:
Class I: Systems evolve into a stable state.
Class II: Systems evolve in a periodic (e.g. fractal) state.
Class III: Systems evolve into random-looking states.
Class IV: Systems evolve into localised complex structures.e.g. Rule 110 or the Game of Life.
[Wolfram, (1994)]
Hector Zenil AIT Tools for Biology and Medicine
Block Decomposition method (BDM)
The Block Decomposition method uses the Coding Theoremmethod. Formally, we will say that an object c has complexity:
K logm,2Dd×d(c) =
∑(ru ,nu)∈cd×d
(nu − 1) log2(Km,2D(ru)) + Km,2D(ru)
(2)where cd×d represents the set with elements (ru, nu), obtainedfrom decomposing the object into blocks of d × d with boundaryconditions. In each (ru, nu) pair, ru is one of such squares and nu
its multiplicity.
[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]
Hector Zenil AIT Tools for Biology and Medicine
Classification of ECA by BDM versus lossless compression
Compressors have limitations (small sequences, timecomplexity)
Applications to machine learning
Problems of classification and clustering
BDM is computationally efficient (runs in O(nd) time, hencelinear (d = 1) time for sequences)
[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]
Hector Zenil AIT Tools for Biology and Medicine
Asymptotic behaviour of complex systems
[Zenil, Complex Systems (2010)]
Hector Zenil AIT Tools for Biology and Medicine
Rule space of 3-symbol 1D CA
[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine
Phase transition detection
Definition
cnt = |C(Mt(i1))−C(Mt(i2))|+...+|C(Mt(in−1))−C(Mt(in))|
t(n−1)
[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine
A measure of programmability
Cnt (M) =
∂f (cnt )
∂t(3)
[Zenil, Complex Systems (2011)]
Hector Zenil AIT Tools for Biology and Medicine
Examples
Figure : ECA Rule 4 has a low C nt for random chosen n and t (it doesn’t
react much to external stimuli). limn,t→∞ C nt (R4) = 0
[H. Zenil, Philosophy & Technology, (2013)]Hector Zenil AIT Tools for Biology and Medicine
Examples (cont.)
Figure : ECA R110 has large coefficient C nt value for sensible choices of t
and n, which is compatible with the fact that it has been proven to becapable of universal computation (for particular semi-periodic initialconfigurations). limn,t→∞ C n
t (R110) = 1
Hector Zenil AIT Tools for Biology and Medicine
Classification of graphs
[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]
Hector Zenil AIT Tools for Biology and Medicine
Characterisation of complex networks
Complex Networks w/preferential attachment algorithms preserveproperties invariant under network size (connectedness, robustness)
at a low cost (unlike costly random nets in the number of links).
[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]
Hector Zenil AIT Tools for Biology and Medicine
Biological case study: Programmable Porphyrin molecules
Much about the dynamics of these molecules is known, one can performMonte-Carlo simulations based in these mathematical models andestablish a correspondence between Wang tiles and simple molecules.
[joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N.Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based
Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
Quantitative dynamics of living systems
Aggregations with similar Kolmogorov complexity cluster in similarconfigurations.
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
Mapping output behaviour to external stimuli: Parameterdiscovery
Parameter Space P → Target Space T
Target space T : Set a configuration from P that triggers thedesired behaviour in T .
To investigate:
Reduction of the parameter spaceCharacterisation of the target space
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
Robustness and pervasiveness
Concentration changes preserving behaviour:
Output parameters that have the highest impact can be tested insilico before experiments in materio.
[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]
Hector Zenil AIT Tools for Biology and Medicine
Orthogonality
Specific concentrations producing certain behaviour using themathematical model to be tested against empirical data.
Hector Zenil AIT Tools for Biology and Medicine
Highlights and goals
Ultimate goal (a few years time): An information-theoreticaltoolbox for systems and synthetic biology
[Complex3D Proteins Database (graph representation) &Z Chen et al. Lung cancer pathways in response to treatments.]
Pushing boundaries.
A cutting-edge mathematical approach
Tools from Complexity theory.
Hector Zenil AIT Tools for Biology and Medicine
New Generation Sequence data analysis
Heavily driven by:
Explosion of experimental data
Difficulties in data interpretation
New paradigms for knowledge extraction
Data mining the behaviour of natural systems
Towards an AIT tool-kit for systems biology, a functionallibrary of programmable biological modules with a SBMLinterface.
Hector Zenil AIT Tools for Biology and Medicine
J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexityfor short sequences, in Cristian Calude (eds), Complexity andRandomness: From Leibniz to Chaitin, World Scientific, 2007.
J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings, Applied Mathematics and Computation, 2011.
H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,Two-Dimensional Kolmogorov Complexity and Validation of theCoding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]
F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Correspondence and Independence of Numerical Evaluations ofAlgorithmic Information Measures, Numerical Algorithms (in 2ndrevision)
F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Calculating Kolmogorov Complexity from the Frequency OutputDistributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]
H. Zenil, Compression-based Investigation of the DynamicalProperties of Cellular Automata and Other Systems, ComplexSystems, Vol. 19, No. 1, pages 1-28, 2010.
Hector Zenil AIT Tools for Biology and Medicine
H. Zenil and J.A.R. Marshall, Some Aspects of ComputationEssential to Evolution and Life, Ubiquity, 2012.
H. Zenil, What is Nature-like Computation? A Behavioural Approachand a Notion of Programmability, Philosophy & Technology (specialissue on History and Philosophy of Computing), 2013.
H. Zenil, On the Dynamic Qualitative Behavior of UniversalComputation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.
H. Zenil, A Turing Test-Inspired Approach to Natural ComputationIn G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,10-12 October 2012), Historical and Contemporary Research inLogic, Computing Machinery and Artificial Intelligence, Proceedingspublished by the Royal Flemish Academy of Belgium for Science andArts, 2013.
G.J. Chaitin A Theory of Program Size Formally Identical toInformation Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.
A. N. Kolmogorov, Three approaches to the quantitative definitionof information Problems of Information and Transmission, 1(1):1–7,1965.
Hector Zenil AIT Tools for Biology and Medicine
L. Levin, Laws of information conservation (non-growth) and aspectsof the foundation of probability theory, Problems of InformationTransmission, 10(3):206–210, 1974.
M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and ItsApplications, Springer, 3rd. ed., 2008.
R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and2, Information and Control, 7:1–22 and 224–254, 1964.
S. Wolfram, A New Kind of Science, Wolfram Media, 2002.
Hector Zenil AIT Tools for Biology and Medicine
top related