a similar fragments merging approach to learn automata on proteins goulven kerbellec & françois...

42
A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Upload: marshall-castles

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

A Similar Fragments Merging Approach to Learn Automata on

Proteins

Goulven KERBELLEC & François COSTEIRISA / INRIA Rennes

Page 2: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L)

Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs

Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties

Experiments

Page 3: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Protein families Amino acid alphabet :

Protein sequence :

Protein data set :

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…>AQP2_RATMWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH…>AQP3_MOUSEMGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT…

{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Common function & Common topology (3D structure)

Page 4: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Characterization of a protein family

x x x x

x x x x x x x x

C H x \ / x

x Zn x x / \ x

C H x x x x

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

ZBT11 ...Csi..CgrtLpklyslriHmlk..H...

ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...

ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern

Page 5: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Expressivity classes of patternsClass Example

A T-C-T-T-G-A

B D-R-C-C-x(2)-H-D-x-C

C G-G-G-T-F-[ILV]-[ST]-[ILV]

D V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]

E G-C-x(1,3)-C-P-x(8,10)-C-C

F C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H

G D-T-A-G-Q-E-*-L-V-G-N-K

H D-T-A-G-[NQ]-*-L-V-G-N-[KEH]

I D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]

J Regular Expression / Automaton

PROSITE PRATTTEIRESIAS

PROTOMATA-L

Page 6: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Characterization

Page 7: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization

Data set D:

Page 8: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Ordering the SFPs Problem :

Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :

dialign Sd support Ss implication Si

Page 9: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Dialign Score

Sd ( f1 , f2 ) = - log P ( L , Sim )

L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L

has the same S

Blossum62similarity

Page 10: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Support Score

Taking into account the representativeness of SFP

Ss (f1,f2,D) = Number of sequences supporting <f1,f2>

f1f2

f

<f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)

Page 11: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index:

Si(f1,f2,D,N) =

avec P(X) =

-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)

P( Ss(f1 ,f2 ,D) ) x |N|

|X|

|D| + |N|

Page 12: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Generalization

Page 13: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

From protein data sets to automata

MASEIKLFW

M A S E I K L F W

Page 14: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

From protein data sets to automata

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Page 15: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Page 16: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G YE K

Y R V

M AS L

F W[I,V]

Page 17: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Merging SFPsMASEIKLFW

MGYEVKYRV

M G YE [I,V] K

Y R V

M AS L

F W

MASEVKLFM MGYEIKYRV

MASEIKYRV MGYEVKLFW

MASEVKYRV MGYEIKLFW

Page 18: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Protein Sequence Data SetList of SFPs

MCA

Automaton / Regular Expression

Ordered List of SFPs

MERGING

Page 19: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"

Page 20: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Identification of Physico-chemical properties

Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group

I,V I,Q,W,P

aliphatic

xI,L,V

no information

C

C

[I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Page 21: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Likelihood ratio test To decide if the multi-set A has been generated

according to a physico-chemical group G or not by a likelihood ratio test:

Given a threshold , we test the expansion of A to G and reject it when LRG/A <

Page 22: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Experiments

Page 23: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

MIP : the Major Intrinsic Protein Family

FamilyMIP

SubfamiliesAQP, Glpf, Gla

Page 24: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Data sets

UNIPROTMIP in SWISS-PROT

Set « T » (159 seq)

Set « M» (44 seq)identity<90%

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)Blast(1<e<100) not MIP

Set « E» (79 seq)

Set « U » (911 seq)

Water-specific

Page 25: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Experiments First Common Fragment on a Family

MIP family Positive set Comparison with pattern discovery tools

Teiresias Pratt Protomata-L (short pattern)

Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation

Protomata-L (short to long pattern)

Page 26: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

First Common Fragment Automaton

Results of 4 patterns scannedon Swiss-Prot protein Database

Set « M» (44 seq)

Learning Set

Learning set

Set « T » (159 seq)Target set

Page 27: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

From short automata to long automata

Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton

Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more

common subparts Long patterns are closed of the topoly (3D-structure) of

the family

Page 28: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Water-specific characterization Leave-one-out cross-validation

Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set

Test set { Si U Sj }

Control set Set T

Implication score

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)

Page 29: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Leave-one-out cross-validation

Page 30: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Error Correcting Cost The error correcting cost of a sequence S represents the

distance (blossum similarity) between S and the closest sequence given by the automaton A.

Distibution of sequences with long automata (size Approx. 100)

Page 31: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Leave-one-out cross-validationWith Error Correcting Cost

Page 32: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Leave-one-out cross-validation

Page 33: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Conclusion & Perspective Good characterization of protein family using automata

(-> hmm structure) No need of a multiple alignment greedy data-driven algorithm

Important subparts localization Physico-chemical identification and generalization

Counter example sets Bringing of knowledge is possible in automata

(-> 2D structure)

Page 34: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Questions ?

?

?

?

?

?

??

??

?

?

?

?? ?

Page 35: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Demo

Page 36: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Protomata-L ’s Approach

First Common Fragment

Page 37: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Protomata-L ’s Approach

To get a more precise automaton

Page 38: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

IDENTIFICATION OF PHYSICOCHEMICAL

GROUPS

Data set (Protein sequences)

Pairs of fragments

SORT

EXTRACTION

Initial Automaton(MCA)

MERGING

IDENTIFICATION OF « GAPS »

Page 39: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Structural discrimination

Page 40: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes
Page 41: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Aromatique

Hydrophobe

Non Informatif

Generalization of an Aquaporins automaton

Page 42: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Physico-chemical properties identification

Ratio likelihood test

AliphaticSmallx