geometric crossovers for supervised motif discovery rolv seehuus ntnu

25
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Post on 21-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Geometric Crossovers for Supervised Motif DiscoveryRolv Seehuus

NTNU

Page 2: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Motivation and Scope

Try out the applicability of the geometric framework, on a supervised motif discovery problem Compare its merits to a previously used operator.

In practice, we test on a very easy problem that existing software can solve easily

Value as test case Building block for more complex motif discovery

problems, that current algorithms can not solve satisfactory

Page 3: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Motif Discovery

Has become a standard problem in bioinformatics Given a set of sequences, figure out what is

special with it …by eliciting motifs in the dataset Differing by…

Motif model Learning algorithms Scoring functions

Page 4: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

The Standard Approach

Do analysis of the positive set of sequences …background distribution… …information content… …statistical significance…

Report motif

Page 5: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Motif discovery as a classification problem Always at least two datasets: The positive, and “the rest” Choose a negative dataset Report motifs best suited to discriminate No need to learn a background model The statistical significance of the motif can be given Discriminative motif discovery has received increased

attention lately

Page 6: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Classification problem

Protein sequences, from the SwissProt database Classified according to protein family (as

specified in the Prosite database) Selected six families, that previously have been

shown to be hard to classify under similar circumstances.

Some of the families can be said to have an overrepresented motif as the ones we can train on

Page 7: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

The Potential Negative Data Set

Huge, compared to the negative Quite common in bioinformatics, and an interesting

problem to cope with in its own right In field:

randomly generated sequences one set of randomly selected sequences random rearrangement of the positive sequences (data not

shown)

The “best practice” was to select the samples randomly from the negative set each generation, so that their size matches the positive set.

Page 8: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Motif Model

Twenty amino acids Wildcard

C...C.C..CDMEGACGGSCACSTCHVIVDP

Motif match, positive sequence

Page 9: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Operators on Motifs

Unit edit move as mutation Mut(A) = {Insert, Delete or Replace a token}

Substring Swapping Crossover (for comparison) Two-point Geometric Crossover

Page 10: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Geometric Crossover

Search space have a metric Mutation is a move in search space Crossover yield children found on the shortest

path between the parents in search space Successfully applied to other problems

Page 11: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Geometric Crossover for Motifs

View motifs as sequences Basic assumption: The edit distance is a good way

to move around in motif space A crossover based on the edit distance, should

yield a good crossover for motif discovery We (arbitrarily) choose unit costs for insertions,

deletions and substitutions

Page 12: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Sequence Alignment

Alignment: put spaces (-) in both sequences such as they become of the same length

Seq1’= agcacac-a Seq2’= a-cacacta Score: 2 An Optimal alignment is an alignment with minimal score The score of the optimal alignment of two sequences

equals their edit distance There often are multiple optimal alignments

Page 13: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Homologous Crossover

1. Pick an optimal alignment for two parent sequences

2. Generate a crossover mask as long as the alignment

3. Recombine as traditional crossover

4. Remove dashes from offspring

Mask = 1101100Seq1’= BANANA-Seq2’= -ANANASSeqA’= BANANASSeqB’= -ANANA-

Child1’= BANANASChild2’= ANANA

Page 14: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Experiments

Two crossovers with same parameters, and mutation only Ten fold cross validation:

Partitioned datasets in ten pieces Trained on 9/10ths Tested the best motif on the remaining test set

Trained on randomly selected subset of SwissProt Tested on entire SwissProt Fitness: Scaled Pearson correlation of confusion matrix

Page 15: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Dynamic behavior during evolution

Page 16: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Maximum Values

Page 17: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Max

Page 18: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Cytochrome

Include the following fragment of a highly conserved motif: C…CH

Which geometric crossover find While substring swapping finds: CH Conservation of length keeps us in the correct

ballpark CH representa local maximum for substring swap

Page 19: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Ferredoxin Contains the following motif: C..C..C...C[PH] Which Substring Swap finds While Geometric Crossover don’t Conservation of length keeps us from finding the correct motif

Page 20: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Population Means

Page 21: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Means

Page 22: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Classification Performance

SubSwap Geometric MutationName Train Test Train Test Train Test

Carbamoyl 0.99 0.57 0.99 0.56 0.99 0.56Dna J 0.9 0.23 0.91 0.23 0.92 0.23Cytochrome 0.8 0.1 0.98 0.36 0.97 0.36Ferredoxin 1 0.87 0.81 0.13 0.96 0.39HLH 0.5 0.07 0.52 0.06 0.59 0.08EF Hand 0.62 0.15 0.67 0.19 0.66 0.19

Medians, of 10 experiments, for each family

Page 23: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Classification Performance - II

Similar for all operators Maybe a slight advantage, for the geometric

crossover if we have A highly conserved motif exist A “ballpark guess” on motif length

Surprisingly, mutation frequently outperforms the other operators

Page 24: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Concluding remarks

The geometric operator is promising - need work It is more length preserving than substring swap The geometric operator need a good guess on motif

length Edit move might not be optimal for motif discovery?

even though, it for some problems shows merit.

Our initial assumption imply an insertion/deletion equally often as replacement in sequence data we are WAY off on that parameter

Page 25: Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

Future Work

Synthetic data with known parameters Include character classes and within motif gaps in

representation Modules (composite motifs) Expand to position weight matrixes