algorithmic information theory and computational biology

Post on 05-Dec-2014

2.066 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I present cutting-edge concepts and tools drawn from algorithmic information theory (AIT) for new generation genetic sequencing, network biology and bioinformatics in general. AIT is the most advanced mathematical theory of information theory formally characterising the concepts and differences between simplicity, randomness and structure. Measures of AIT will empower computational medicine and systems biology to deal with big data, sophisticated analytics and a powerful new understanding framework.

TRANSCRIPT

Algorithmic Information Theory andComputational Biology

Hector Zenil

Unit of Computational MedicineKarolinska Institutet

Sweden

Hector Zenil AIT Tools for Biology and Medicine

Complex Adaptive Systems (CAS)

Hector Zenil AIT Tools for Biology and Medicine

Complexity is hard to quantify in biology

Mapping quantitative stimuli to qualitative behaviour

Hector Zenil AIT Tools for Biology and Medicine

Information Theory in Biology

Sequence alignment

Pattern recognition

Sequence logos

Binding site detection

Motif detection

Consensus sequences

Biological significance

[based on Claude Shannon’s Information Theory, 1940]Hector Zenil AIT Tools for Biology and Medicine

Algorithmic Information Theory

Which sequence looks more random?(a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

(b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC(c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC

Classical probability theory vs. Kolmogorov Complexity

Definition

KU(s) = min{|p|,U(p) = s} (1)

Compressibility

A sequence with low Kolmogorov complexity is c-compressible if|p|+ c = |s|. A sequence is random if K (s) ≈ |s|.

[Kolmogorov (1965); Chaitin (1966)]Hector Zenil AIT Tools for Biology and Medicine

Examples

Example 1

Sequences like (a) have low algorithmic complexity because theyallow a short description. For example, “20 times A”. No matterhow long (a) grows in length, the description increases only byabout log2(k) (k times A).

Example 2

The sequence (b) is algorithmic random because it doesn’t seem toallow a (much) shorter description other than the length of (b)itself.

For example, for sequence (a), a proof of non-randomness impliesthe exhibition of a short program. Compressibility is therefore asufficient test of non-randomness.

Hector Zenil AIT Tools for Biology and Medicine

Example of an evaluation of K

The sequence (b) GCGCGC...GC is not algorithmic random (or haslow K complexity) because it can be produced by the followingprogram (take G=0 and C=1):

Program A(i):1: n:= 02: Print n mod 23: n:= n+14: If n=i Goto 65: Goto 26: End

The length of A (in bits) is an upper bound of K (GCGCGC ...GC ).

Hector Zenil AIT Tools for Biology and Medicine

The ultimate measure of pattern detection and optimalprediction

Kolmogorov and Chaitin, Schnorr, and Martin-Lofindependently provided 3 different approaches to randomness(compression, predictability and typicality).

They proved (for infinite sequences):

incompressibility ⇐⇒ unpredictability ⇐⇒ typicality

When this happens in mathematics a concept has objectively beencaptured (randomness).

This is why prediction in biology is hard. AIT tells that no effectivestatistical test will succeed to recognise all patterns and nocomputable technique can fully predict all outcomes. The problemis deeply connected to computability and algorithmic informationtheory.

[Solomonoff (1964); Kolmogorov (1965); Chaitin (1969)]Hector Zenil AIT Tools for Biology and Medicine

Information distances and similarity metrics

Measures waiting to be introduced in bioinformatics

Information Distance ID(x , y) = max K (x |y),K (y |x)

Universal Similarity MetricUSM(x , y) = max K (x |y),K (y |x)/max K (x),K (y)

Normalised Information Distance:NCD(x , y) = K (xy)−min K (x),K (y)/max K (x),K (y) andNCD.

Normalized Compression Measure (NCM): NC (s) = K (s)/|s|(asymptotic behaviour)

Bennett’s Logical Depth:LDd(s) = min{t(p) : (|p| − |p∗| < d) and (U(p) = s)}

(e.g. of an app. see Zenil, Complexity 2011)

Hector Zenil AIT Tools for Biology and Medicine

Non-systematic but succesful attempts in biology

GenCompress is a compression algorithm to compress DNAsequences: d(x , y) = 1− (K (x)− K (x |y))/K (xy)

NCD applied to genetic similarity:

AIT looks at the genome as information, not as data (letters).Counting: traditional Shannon-entropy style sequencing.Interpreting: AIT. The full power of the theory hasn’t yet beenunleashed.

Hector Zenil AIT Tools for Biology and Medicine

To be or not to be...

Borel’s “Infinite Monkey” theorem

0

Syntax error

1

Input

∞∞

1024

“To be or not

to be, that is the

question.”

CH3

√2

π

Hector Zenil AIT Tools for Biology and Medicine

Algorithmic probability

Hector Zenil AIT Tools for Biology and Medicine

Producing π

This C-language code produces the first 1000 digits of π (GjerritMeinsma):

long k = 4e3, p, a[337], q, t = 1e3;main(j){for(; a[j = q = 0]+ = 2, k ; )for(p = 1 + 2 ∗ k ; j < 337; q = a[j ] ∗ k + q%p ∗ t, a[j + +] = q/p)k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); }

Producing non-random sequences:

If an object has low Kolmogorov complexity then it has a short descriptionand a greater probability to be produced by a random program. The lessrandom a string the more likely to be produced by a short program.

Hector Zenil AIT Tools for Biology and Medicine

Biological Big Data Analysis

The information bottleneck:

Small Data matters: Local measurements of information contentare a good indication of the global information content of an

object. Evidence: BDM Image classification. Compression works atlarge scales looking for long regularities, while BDM is very local.

Yet both yield astonishing similar results for this object sizes.

Hector Zenil AIT Tools for Biology and Medicine

Complementary methods for different sequence lengths

The methods to approximate K coexist and complement eachother for different sequence lengths.

short strings long strings scalability< 100 bits > 100 bits

Lossless compressionmethod ×

√ √

Coding Theoremmethod

√× ×

Block Decompositionmethod

√ √ √

[Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional KolmogorovComplexity and Validation of the Coding Theorem Method by

Compressibility (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Coding Theorem method and lossless compression

The transition between one method and the other. What is complex forthe Coding Theorem method is less compressible.

[Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence ofNumerical Evaluations of Algorithmic Information Measures (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Online Algorithmic Complexity Calculator

Provides: Shannon’s entropy, lossless compression (Deflate) values,Kolmogorov complexity approximations and relative frequency order(algorithmic probability).

A Mathematica API and an R module.

Datasets available online at the Dataverse Network.

Basic data analysis tool for shorts sequence comparison.

[http://www.complexitycalculator.com]

Hector Zenil AIT Tools for Biology and Medicine

Online Algorithmic Complexity Calculator 2

[http://www.complexitycalculator.com]

Hector Zenil AIT Tools for Biology and Medicine

Simulation of natural systems w/complex symbolic systems

An elementary cellular automaton (ECA) is defined by a localfunction f : {0, 1}3 → {0, 1},

f maps the state of a cell and its two immediate neighbours (range= 1) to a new cell state: ft : r−1, r0, r+1 → r0. Cells are updated

synchronously according to f over all cells in a row.

[Wolfram, (1994)]

Hector Zenil AIT Tools for Biology and Medicine

Behavioural classes of CA

Wolfram’s classes of behaviour:

Class I: Systems evolve into a stable state.

Class II: Systems evolve in a periodic (e.g. fractal) state.

Class III: Systems evolve into random-looking states.

Class IV: Systems evolve into localised complex structures.e.g. Rule 110 or the Game of Life.

[Wolfram, (1994)]

Hector Zenil AIT Tools for Biology and Medicine

Block Decomposition method (BDM)

The Block Decomposition method uses the Coding Theoremmethod. Formally, we will say that an object c has complexity:

K logm,2Dd×d(c) =

∑(ru ,nu)∈cd×d

(nu − 1) log2(Km,2D(ru)) + Km,2D(ru)

(2)where cd×d represents the set with elements (ru, nu), obtainedfrom decomposing the object into blocks of d × d with boundaryconditions. In each (ru, nu) pair, ru is one of such squares and nu

its multiplicity.

[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Classification of ECA by BDM versus lossless compression

Compressors have limitations (small sequences, timecomplexity)

Applications to machine learning

Problems of classification and clustering

BDM is computationally efficient (runs in O(nd) time, hencelinear (d = 1) time for sequences)

[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Asymptotic behaviour of complex systems

[Zenil, Complex Systems (2010)]

Hector Zenil AIT Tools for Biology and Medicine

Rule space of 3-symbol 1D CA

[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine

Phase transition detection

Definition

cnt = |C(Mt(i1))−C(Mt(i2))|+...+|C(Mt(in−1))−C(Mt(in))|

t(n−1)

[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine

A measure of programmability

Cnt (M) =

∂f (cnt )

∂t(3)

[Zenil, Complex Systems (2011)]

Hector Zenil AIT Tools for Biology and Medicine

Examples

Figure : ECA Rule 4 has a low C nt for random chosen n and t (it doesn’t

react much to external stimuli). limn,t→∞ C nt (R4) = 0

[H. Zenil, Philosophy & Technology, (2013)]Hector Zenil AIT Tools for Biology and Medicine

Examples (cont.)

Figure : ECA R110 has large coefficient C nt value for sensible choices of t

and n, which is compatible with the fact that it has been proven to becapable of universal computation (for particular semi-periodic initialconfigurations). limn,t→∞ C n

t (R110) = 1

Hector Zenil AIT Tools for Biology and Medicine

Classification of graphs

[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]

Hector Zenil AIT Tools for Biology and Medicine

Characterisation of complex networks

Complex Networks w/preferential attachment algorithms preserveproperties invariant under network size (connectedness, robustness)

at a low cost (unlike costly random nets in the number of links).

[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]

Hector Zenil AIT Tools for Biology and Medicine

Biological case study: Programmable Porphyrin molecules

Much about the dynamics of these molecules is known, one can performMonte-Carlo simulations based in these mathematical models andestablish a correspondence between Wang tiles and simple molecules.

[joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N.Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based

Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Quantitative dynamics of living systems

Aggregations with similar Kolmogorov complexity cluster in similarconfigurations.

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Mapping output behaviour to external stimuli: Parameterdiscovery

Parameter Space P → Target Space T

Target space T : Set a configuration from P that triggers thedesired behaviour in T .

To investigate:

Reduction of the parameter spaceCharacterisation of the target space

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Robustness and pervasiveness

Concentration changes preserving behaviour:

Output parameters that have the highest impact can be tested insilico before experiments in materio.

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Orthogonality

Specific concentrations producing certain behaviour using themathematical model to be tested against empirical data.

Hector Zenil AIT Tools for Biology and Medicine

Highlights and goals

Ultimate goal (a few years time): An information-theoreticaltoolbox for systems and synthetic biology

[Complex3D Proteins Database (graph representation) &Z Chen et al. Lung cancer pathways in response to treatments.]

Pushing boundaries.

A cutting-edge mathematical approach

Tools from Complexity theory.

Hector Zenil AIT Tools for Biology and Medicine

New Generation Sequence data analysis

Heavily driven by:

Explosion of experimental data

Difficulties in data interpretation

New paradigms for knowledge extraction

Data mining the behaviour of natural systems

Towards an AIT tool-kit for systems biology, a functionallibrary of programmable biological modules with a SBMLinterface.

Hector Zenil AIT Tools for Biology and Medicine

J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexityfor short sequences, in Cristian Calude (eds), Complexity andRandomness: From Leibniz to Chaitin, World Scientific, 2007.

J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings, Applied Mathematics and Computation, 2011.

H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,Two-Dimensional Kolmogorov Complexity and Validation of theCoding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]

F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Correspondence and Independence of Numerical Evaluations ofAlgorithmic Information Measures, Numerical Algorithms (in 2ndrevision)

F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Calculating Kolmogorov Complexity from the Frequency OutputDistributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]

H. Zenil, Compression-based Investigation of the DynamicalProperties of Cellular Automata and Other Systems, ComplexSystems, Vol. 19, No. 1, pages 1-28, 2010.

Hector Zenil AIT Tools for Biology and Medicine

H. Zenil and J.A.R. Marshall, Some Aspects of ComputationEssential to Evolution and Life, Ubiquity, 2012.

H. Zenil, What is Nature-like Computation? A Behavioural Approachand a Notion of Programmability, Philosophy & Technology (specialissue on History and Philosophy of Computing), 2013.

H. Zenil, On the Dynamic Qualitative Behavior of UniversalComputation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.

H. Zenil, A Turing Test-Inspired Approach to Natural ComputationIn G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,10-12 October 2012), Historical and Contemporary Research inLogic, Computing Machinery and Artificial Intelligence, Proceedingspublished by the Royal Flemish Academy of Belgium for Science andArts, 2013.

G.J. Chaitin A Theory of Program Size Formally Identical toInformation Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.

A. N. Kolmogorov, Three approaches to the quantitative definitionof information Problems of Information and Transmission, 1(1):1–7,1965.

Hector Zenil AIT Tools for Biology and Medicine

L. Levin, Laws of information conservation (non-growth) and aspectsof the foundation of probability theory, Problems of InformationTransmission, 10(3):206–210, 1974.

M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and ItsApplications, Springer, 3rd. ed., 2008.

R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and2, Information and Control, 7:1–22 and 224–254, 1964.

S. Wolfram, A New Kind of Science, Wolfram Media, 2002.

Hector Zenil AIT Tools for Biology and Medicine

top related