c. karanikas, n. atreas, p. polychronidou, a. bakalakos department of informatics

36
Problems for Effective Problems for Effective Analysis of Biological Analysis of Biological Data: Data: Discrete Transforms on Symbolic Sequences for String- Discrete Transforms on Symbolic Sequences for String- Matching, Pattern-Recognition and Grammar Detection Matching, Pattern-Recognition and Grammar Detection C. Karanikas, N. Atreas, P. Polychronidou, A. Bakalakos Department of Informatics Aristotle University of Thessaloniki

Upload: corby

Post on 21-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Problems for Effective Analysis of Biological Data: Discrete Transforms on Symbolic Sequences for String-Matching, Pattern-Recognition and Grammar Detection. C. Karanikas, N. Atreas, P. Polychronidou, A. Bakalakos Department of Informatics Aristotle University of Thessaloniki. Abstract. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Problems for Effective Analysis Problems for Effective Analysis of Biological Data:of Biological Data:

Discrete Transforms on Symbolic Sequences for String-Matching, Discrete Transforms on Symbolic Sequences for String-Matching, Pattern-Recognition and Grammar DetectionPattern-Recognition and Grammar Detection

C. Karanikas, N. Atreas,

P. Polychronidou, A. Bakalakos

Department of Informatics

Aristotle University of Thessaloniki

Page 2: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

AbstractAbstract We draw our inspiration from the basic operations that nature

performs in biological sequences (Replication, Dilation, Translation, Splicing, etc).

We introduce a variety of new discrete (linear or non linear) invertible transforms on symbolic sequences:

– The Cyclic Class Transform– The Stern Brocot Transform– The Generalisation of Haar Transform– The Haar-Riesz Product

Our main target with these transforms is – to encode-decode local information on strings – to make fast non-exact string matching and pattern recognition.– to identify some of the grammatical rules of the string-collections.

Thus we deal with the notion of similarity and distances (such as the edit distance), i.e. distances measuring the number of the operations delete, insert and substitute required to identify two strings.

Page 3: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

New Era in Security - related New Era in Security - related ResearchResearch

Global Security depends on Information Security Mobility, Heterogeneity, Size, Complexity of Information Systems make it difficult to

detect, prevent, respond to, overcome security threats. We deal with a fundamental problem: Detect, recognize, interpret and ultimately assign

a meaning to symbolic information such as: – Biological/Biometric data – Intelligence Information – Any other form of information

We formulate the problem in mathematical/information theoretic terms: Given a pattern and a string find parts of the string similar to the pattern

Or, find strings that are identical. Or, find strings that are similar. Also, make the algorithms work in a “noisy” environment (I.e. “forgive” a few

accidental errors on a big string) Also, find the underlying grammar of the string.

****We need new mathematical tools for effective analysis of data.

Page 4: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Human genome seems infinite. Human genome seems infinite. Is compressed information Is compressed information

(3,000,000,000 bases A,C,G,T) (3,000,000,000 bases A,C,G,T)

Page 5: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Sample of an RNASample of an RNA

CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT

Page 6: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

3D-Simulation of a protein 3D-Simulation of a protein

Each protein can be written in 3D as a string in an alphabet of 20 letters.

Page 7: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

A proteinA protein

Page 8: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Coding using Prime NumbersCoding using Prime NumbersLet {x1 ,…,xn} be a symbolic sequence in an

alphabet {ε1,…, εk} (w.l.g. consider epsilons primes) Let p1,…,pn,… , be an increasing sequence of

primes.The map {x1 ,…,xn} Σxi (pi/pi+1) provides an

invertible coding for the collection of all symbolic sequences (as above)

Similar maps: {x1 ,…,xn} Σi (1-xi/pi) or{x1 ,…,xn} Πi (1-xi/pi)

Page 9: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The Prime CodingThe Prime Coding

Page 10: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Decoding Decoding

Page 11: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

One dramatic way thatOne dramatic way that mathematics has come mathematics has come in handy for the study of the genome involves in handy for the study of the genome involves

the concept of distancethe concept of distance Given a collection of strings (genes/proteins written in an

alphabet of four/twenty letters), one can assign a number to pairs of this collection, which tell one how distant or how similar or how close they are.

The Edit (Levenshtein) distance counts the minimum number of the operations: Delete, Insert and Substitute to make to strings identical. E.g. the Edit distance of {1,0,1,1,0,1} and {1,1,1,0,1,0}is 2. (delete the second digit, insert 0 in 5 th position)

Page 12: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Levenshtein Levenshtein (or Edit) (or Edit) DistanceDistance of of

{1,1,0,0,1,0,1} and {1,0,0,0,1,0,0,1}{1,1,0,0,1,0,1} and {1,0,0,0,1,0,0,1}

Page 13: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

A new distance on strings based on their cyclic A new distance on strings based on their cyclic classes and the distribution of charactersclasses and the distribution of characters

The problem given two strings : ACCGHAAGGHCC , ACGHHAGGHACC

Replace A 0, C1,H2 and G 3 and consider the corresponding numbers.

Find a new (fast) transform on symbolic sequences and a measure on them, which is “invariant” under a small number of changes as insert, delete and replace. Next we provide a new distance suitable for biodata (biological or biometric)

Page 14: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Stern-Brocot TransformStern-Brocot Transform

Consider the matrices T(0) and T(1) respectively

Each string {ε1,…,εn}, where εi = 0 or 1, corresponds to the (dot) product of matrices:

T(ε1).T(ε2)…T(εn ). For example the string {1,0,1,0,1}T(1).T(0).T(1).T(0).T(1) =

The pair {8,13} (the sum of each row) is unique and so {1,0,1,0,1} {8,13}

Page 15: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Stern-Brocot (1880) Transform Stern-Brocot (1880) Transform for an alphabet of 3 symbols for an alphabet of 3 symbols

{0,1,2}{0,1,2}Now T(0),T(1) and T(2) areSo we have {1,2,1,0,2} = {5, 31 ,17}

By a simple algorithm we get {5, 31 ,17} {1,2,1,0,2}

Page 16: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Inspired by the Stern-Brocot Inspired by the Stern-Brocot transform we develop a measure transform we develop a measure

of similarity for stringsof similarity for stringsThe second string below, has two

differences with the first one, the triples of corresponding numbers are similar. We used successfully this idea to find similar parts of genes.

{B,A,A,B,A,A,B,G,A,B,G,G,G,B,B}{B,A,B,B,A,A,G,A,B,G,G,G,B,B}{0.740721, 0.0476564, 0.211623}{0.735667, 0.0487892, 0.215544}

Page 17: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

P-adic Dilation and P-adic Dilation and Splicing/Translation operations Splicing/Translation operations

on matrices and vectors on matrices and vectors Definition 1: Let Mn,m the

space of all n x m matrices, we define the p-adic dilation operation Dp: Mn,m --> Mn,pm , p=2, 3, …. such that:

Dp(M)={ Mi,[j/p], i=,1,…,n, j = 1, …,m p}

where [x] is the largest integer greater than or equal to the number x.

Page 18: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Small changes (local Small changes (local differences) in the genetic differences) in the genetic

code provide polymorphism code provide polymorphism

Page 19: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

In Nature Everything is UniqueIn Nature Everything is UniqueThe Small Preens has right to The Small Preens has right to

say that his roses are unique in say that his roses are unique in all the world all the world

Page 20: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Initial idea for discrete transforms Initial idea for discrete transforms coding local informationcoding local information

A mathematical transform mimicking antigen processing must code local information (as. the Haar wavelet transform) cutting data in pieces (peptides). Note that a protein (symbolic sequence in an alphabet of 20 letters) is a union of peptides.

Introduce transforms/ computational tools for string marching, for pattern recognition of grammars, for detecting dynamical systems as hidden Markov process and Riesz products.

Results and experience from reserch projects: European Project: IST-2000-26016, Immunocomputing GSRT 2005-6: Mathematics for Bioinformatical applications –

Multiresolution methods for the study of biodata. Greek-Bulgarian project (2006-7), on Application of Wavelet Theory

in Bioinformatics

Page 21: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Splicing vectors / vector translations /cyclic Splicing vectors / vector translations /cyclic classes of numbersclasses of numbers

Definition 2 Let 0m = {0,0,…,0} is in M1,m

The p-adic vector translations Tp: M1,m -> M1,pm , p=2, 3, …. such that: Tp(v)= Join[0km,v, 0jm], such that k+j+1=p, where Join means splicing vectors together.

Page 22: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

How to Dilate and translate block sub-How to Dilate and translate block sub-matrices (case p=2)matrices (case p=2)

Page 23: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The case p=3The case p=3

Page 24: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Sparse Matrices (SPMM)Sparse Matrices (SPMM)

The matrices above called SPMM, are iteratively generated by dilation and translation of block sub-matrices ( work by N. Atreas, C.K. and P. Polychronidou).

The determinant of SPMM is ± 1The Inverse of SPMM is iteratively generated too.

Page 25: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The Inverse of SPMM p = 2The Inverse of SPMM p = 2

Page 26: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Inverse of SPMM, p =3, n =1,2 Inverse of SPMM, p =3, n =1,2

Page 27: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The Gram Schmidt The Gram Schmidt orthonormalization process of SPMM orthonormalization process of SPMM

matrices for p = 2 give the for n = matrices for p = 2 give the for n = 1,2,3 the Haar matrices: 1,2,3 the Haar matrices:

Page 28: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

SPMM and Haar matrices p =3 , SPMM and Haar matrices p =3 , n=2n=2

13

13

13

13

13

13

13

13

13

132

132

132

132

132

132

23 23 23

16

16

16

16

16

16

0 0 0

16 16 23 0 0 0 0 0 0

0 0 0 16 16 23 0 0 0

0 0 0 0 0 0 16 16 2312

12

0 0 0 0 0 0 0

0 0 0 12

12

0 0 0 0

0 0 0 0 0 0 12

12

0

Page 29: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The Haar orthonormal system of The Haar orthonormal system of LL22[0,1] in base p =3[0,1] in base p =3

Page 30: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Theorem: Given t={t(k), k = 1,…pTheorem: Given t={t(k), k = 1,…pnn} find } find the coefficients {x(k), k = 1,…pthe coefficients {x(k), k = 1,…pnn} of } of

HRPHRPTheorem Let rj ,j = 1,…, pn be the j row of the

pn x pn Haar matrix, and R(p,n) = ∏ ( 1+ x(k) rk ) its Haar Riesz product with coefficients {x(k), k = 1,…pn} , If t={t(k), k = 1,…pn} is any non-negative data the system: R(p,n) . rj = t . rj has a unique solution w.r.t. {x(k), k = 1,…pn} .

Page 31: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Riesz Haar coefficients for the Riesz Haar coefficients for the data {tdata {t11, t, t22, t, t33, t, t44, t, t55}}

Page 32: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Expanding parentheses of Haar Riesz Expanding parentheses of Haar Riesz product we get : {tproduct we get : {t11, t, t22, t, t33, t, t44, t, t55}}

Page 33: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

The system for case p=3 and n=2.The system for case p=3 and n=2.The coefficients of the system are The coefficients of the system are

elements of the corresponding matrix. elements of the corresponding matrix.

Page 34: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Find Hidden Markov structure of Find Hidden Markov structure of genesgenes

Page 35: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Application: Given a collection of “Cantor Application: Given a collection of “Cantor Strings” the Riesz-Haar Coefficients reveal Strings” the Riesz-Haar Coefficients reveal

the Cantor Grammar the Cantor Grammar

Input 729 (3^6) samples of a Cantor collection. Output the Riesz Haar co-efficients. Observe that the collection is orthogonal on certain rows of the Haar matrix.

Page 36: C. Karanikas, N. Atreas,  P. Polychronidou, A. Bakalakos   Department of Informatics

Thanks for listening Thanks for listening