detection of spaced motifs using submotif pattern mining

50
Detection of Spaced Motifs using Submotif Pattern Mining S.M. Yiu Department of Computer Science The University of Hong Kong Joint work with Ken Sung (Genome Institute of Singapore and NUS) Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research) 1

Upload: shaun

Post on 23-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Detection of Spaced Motifs using Submotif Pattern Mining. S.M. Yiu Department of Computer Science The University of Hong Kong. Joint work with Ken Sung (Genome Institute of Singapore and NUS) Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research). Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Detection of Spaced Motifs using Submotif Pattern Mining

Detection of Spaced Motifs using Submotif Pattern Mining

S.M. YiuDepartment of Computer Science

The University of Hong Kong

Joint work with Ken Sung (Genome Institute of Singapore and NUS)

Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research)

1

Page 2: Detection of Spaced Motifs using Submotif Pattern Mining

Outline

• Introduction to Bioinformatics Research & Motif Finding Problem

• Spaced Motifs and Our contributions

• Some Technical Background – String representation of Motifs

• The Model for Spaced Motif

– Gaps in Motifs

– Submotif notion

• SPACE

– Algorithm for Spaced Motifs

• Experimental Results & Conclusions

• Other Related Projects

2

Page 3: Detection of Spaced Motifs using Submotif Pattern Mining

Bioinformatics Research

Goal: To help bio-medical research community to discover biological meaningful knowledge

How CS helps: By providing computational efficient tools for analysis of huge amount of data and/or solving computational intensive problems.

May make use of techniques in Algorithms, Database, AI etc.

3

Page 4: Detection of Spaced Motifs using Submotif Pattern Mining

4

Typical Process of Bioinformatics Research

Biological Scenario

Math/computational model

Verify & Test the solution using real

data

Develop efficient & effective algorithms

(solution)

siRNA can be used to de-activate genes causing cancers[How to select good candidates is not easy!]

Formulate criteria for forming siRNA molecules

Develop a computational tool for selecting siRNA candidates

An example

Page 5: Detection of Spaced Motifs using Submotif Pattern Mining

Two issues:1)Difficult to derive a correct model [even the bio-medical researchers may not know the exact criteria];

2)Complexity of the problem and huge amount of data

Typical Process of Bioinformatics Research

Biological Scenario

Math/computational model

Verify & Test the solution using real

data

Develop efficient & effective algorithms

(solution)

5

Page 6: Detection of Spaced Motifs using Submotif Pattern Mining

The motif finding problem - a fundamental and critical problem in bio-medical research

- To activate a gene, there is a corresponding short sequence (called binding site) in our DNA needs to be first interacted (binded) by a particular transcription factor.

6

…..AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA…..

Binding site

Transcription Factor A

Gene A

activated to produce a protein

DNA sequence is considered as a sequence of {A, C, G, T}

Page 7: Detection of Spaced Motifs using Submotif Pattern Mining

7

•The binding sites that can interact with the same transcription factor are similar in pattern (but usually do not have the same pattern). The general pattern of a set of related binding sites is referred as the motif.

AGCTAAACCACGTGGCATGG

AGCCACGCGCGTGGCATGG

AGCTAAACCACGTGGCATGTGC

ACCCGTGCCACGTGGCATGG

…GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAAT……GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG……CCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTG……GTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAG…………………………………………………………………………………………………………………………………………

Given: A set of sequences that possibly contain the binding sites of the same transcription factor.What we know: the binding sites should occur abundantly (unlikely to occur by random) inside these sequences.The Motif Finding Problem: to identify a set of similar patterns that occur a lot in these sequences and predict the motif.

Page 8: Detection of Spaced Motifs using Submotif Pattern Mining

Importance of this motif finding problem

- Solving this problem provides critical information for the study of genetic diseases and drug design.- Motif finding becomes a daily task for bio-medical researchers.

Within the last few years, more than 30 new software tools were developed and more than 100 research papers were published!

8

Page 9: Detection of Spaced Motifs using Submotif Pattern Mining

Issues Representations of Motif

Different representations can model different motifs (binding sites) and have different descriptive power.

Scoring function To formalize the concept of “occurred abundantly” (not occur by random).

Complexity of the motif finding problem: some versions have been proved to be NP-hard.

Practical concern: Dataset with noise ………

Page 10: Detection of Spaced Motifs using Submotif Pattern Mining

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

Still a lot of known binding sites not being captured by these models

10

Page 11: Detection of Spaced Motifs using Submotif Pattern Mining

The Spaced Motif

Our contributions:- Traditional model assumes no gaps in motif or allows at most one gap. We derive a new model to allow multiple gaps with different lengths in motifs (spaced motifs)- Formulate the spaced motif finding problem as a frequent submotif mining problem (database technique)- We developed an effective algorithm (SPACE) to solve the problem.- Experimental results verified that the new model represents real motifs better than existing models.

11

Page 12: Detection of Spaced Motifs using Submotif Pattern Mining

Two common representations for motif:

1. String Representation

A motif is modeled as a length-L string- two variations: mismatch model and wildcard model

A (L, d)-mismatch motif M is a length-L string such that a Length-L string with at most d mismatches from M is considered as a binding site.

TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Motif: TTGACABinding sites: with at most 1 mismatch

This is a (6, 1)-mismatch motif!

Note: the mismatch can occur at any position!

Page 13: Detection of Spaced Motifs using Submotif Pattern Mining

1. String Representation

A motif is modeled as a length-L string- two variations: mismatch model and wildcard model

A (L, d)-wildcard motif M is a length-L string such that a binding site is one with at most d wildcards w.r.t. M.

TTGACATCGATATTGATATGGAAATTGAGATCGACATGGAGATTGACATAGATATCGAGA

In this case, the binding sites should be better modeled by a (6, 2)-wildcard motif: T*GA*A

2. Positional Weight Matrix (PWM) Representation

Page 14: Detection of Spaced Motifs using Submotif Pattern Mining

Gaps in motif

Examples of motifs that are not contiguous substrings in the sequence!!

e.g. ARCA-P (Liu and DeWulf, 2004)The motif is known to be “GTTAA n n n n n n GTTAA”

HAP1 (Scherling and Holmberg, 1996)The motif is known to be “CGG n n n TA n CGG n n n TA” Gaps (spacers)

Issues: We don’t know the number of gaps and the length of each gap! The search space may be huge if we try all combinations.

14

Page 15: Detection of Spaced Motifs using Submotif Pattern Mining

Previous work?

• Most previous work usually does not allow gaps in a motif (called monads).

• Some algorithms (like MITRA) allow gaps. But they only allow one gap.

In this work, we attempt to develop a better approach to locate motifs with arbitrary number of gaps of possibly different lengths.

Page 16: Detection of Spaced Motifs using Submotif Pattern Mining

Our Proposed Motif Model

We define a (spaced) motif M as a length-L string with characters {A, C, G, T, n} with at least c x L non-n characters (c is called the coverage).

E.g. L=15, c=l/2,M=ACGCnnGCGTnCTCA

Each maximal substring of consecutive n represents a gap/spacer.

Each maximal substring of {A, C, G, T} characters represents a segment.

16

Assumption: total gap length is relatively small w.r.t. L.

Page 17: Detection of Spaced Motifs using Submotif Pattern Mining

E.g. L=15, c=l/2,M=ACGCnnGCGTnCTCA

Note: we allow any number of segments (and gaps), and segments (as well as gaps) can have different lengths.

Notion of submotif [to capture similarity!]:

We fix a parameter ls, any length-ls substring within any segment of M is called a submotif.

e.g. if ls = 3, ACG; CGC; GCG; CGT; CTC; TCA are submotifs of M.

17

Page 18: Detection of Spaced Motifs using Submotif Pattern Mining

Instance (binding site) of a motifA length-L string I in a given sequence is called an instance (a binding site) of M if I contains all submotifs with at most d mismatch ((ls,d)-substrings) for each submotif.

e.g. Assume (ls, d) = (3,1)M=ACGCnnGCGTnCTCAI=ATGCcgGCATtCTTA I’=ATGCcgCCATtCTTA

I is an instance; I’ is not!

The problem: Given a set S of t sequences, find all (spaced) motifs M such that M has at least q instances.

18

Page 19: Detection of Spaced Motifs using Submotif Pattern Mining

Example of Motif

• If q = 4, M=ACGCnnGCGTnCTCA is a spaced motif as M has 4 instances in S

TTCAACGCACGCGTTCTCAGCTCAGCTGAACGTCGGTCACT

CGACATTCCACGGCTATGCCGGCATACGCACCGTACGCTCC

CGGCAATTACTCGTCAGTTCTACGTGCGCGACCCCATCCCA

ACTGCGCATGAACACTCCTGAGTGATCACTAAAGTTCGGTG

19

Page 20: Detection of Spaced Motifs using Submotif Pattern Mining

Idea of the SPACE Algorithm

• Input sequences S:TTGATACCGAAGATACCGATTAGAAATCACTCA

ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC

GAAGACCGTCATGAGAAATCGCATACACGAGCA

TTCACCCGATAAAAATAAGGCTGTCTGGACTAA

TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

20

Page 21: Detection of Spaced Motifs using Submotif Pattern Mining

Then, for each other length-L substring in S, check if the sharing (ls, d)-substrings cover at least c x L characters. e.g. L = 20, ls = 5, d = 1, c = ½ (i.e. c x L = 10).S2: ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC

Finding motif candidates

-GAAGATACCGATTAGAAATC

-GAAAA TAAAA

-GAAAAGCAGTAGTAAAACTG

Consider any length-L substring in S.e.g. L = 20S1: TTGATACCGAAGATACCGATTAGAAATCACTCA“GAAGATACCGATTAGAAATC” starting at pos 9

21

Page 22: Detection of Spaced Motifs using Submotif Pattern Mining

S1:TTGATACCGAAGATACCGATTAGAAATCACTCAS2:ACTACAGAAAAGCAGTAGTAAAACTGTACAGTCS3:GAAGACCGTCATGAGAAATCGCATACACGAGCAS4:TTCACCCGATAAAAATAAGGCTGTCTGGACTAAS5:TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

More examples:

Note that the set of shared (ls, d) substrings may not be the same for every case.

22

Page 23: Detection of Spaced Motifs using Submotif Pattern Mining

Extracting sharing (ls, d)-substrings– GAAGATACCGATTAGAAATC

– GAAAA TAAAA– GAAAAGCAGTAGTAAAACTG (1,13)

– GAAGA ATGAGAAATC– GAAGACCGTCATGAGAAATC (1,11,12,13,14,15,16)

– CGATAA ATAAG– CCCGATAAAAATAAGGCTGT (3,4,12)

– GAAGAC TAGAAA– GAAGACCAGCAGTAGAAAAA (1,2,13,14)

23

Page 24: Detection of Spaced Motifs using Submotif Pattern Mining

Mining the frequent pattern• Find the patterns which occur at least q

times.• Example (q=3):

– (1, 13)– (1, 11, 12, 13, 14, 15, 16)– (3, 4, 12)– (1, 2, 13, 14)

• The pattern is (1, 13)• If the patterns can give a coverage of >

c x L, then we found a spaced motif.• Hence, GAAGAnnnnnnnTAGAA• This step is done using a well-known

data mining technique.24

Page 25: Detection of Spaced Motifs using Submotif Pattern Mining

Generate the results and compute the significant score

• M = GAAGAnnnnnnnTAGAA

TTGATACCGAAGATACCGATTAGAAATCACTCAACTACAGAAAAGCAGTAGTAAAACTGTACAGTCGAAGACCGTCATGAGAAATCGCATACACGAGCATTCACCCGATAAAAATAAGGCTGTCTGGACTAATCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

• The motif M is scored based on how unlikely this pattern can occur in random. [Details omitted!]

25

Page 26: Detection of Spaced Motifs using Submotif Pattern Mining

Summary of the algorithm

1. Finding motif candidates– For every length-L substring P of S

• Find all motif instances of P in S• Extract the sharing (ls,d)-substrings in all motif

instances• Mine patterns occuring more than q times• Generate the motif and its significant score

2. Sort the motifs based significant score3. Report the motifs

26

Page 27: Detection of Spaced Motifs using Submotif Pattern Mining

Some technical details

• A straight-forward implementation of the previous algorithm is slow.

• The bottleneck is on the step of finding motif candidates (it takes O(Ln2) time, where n is the total length of the sequences).

27

Page 28: Detection of Spaced Motifs using Submotif Pattern Mining

Idea: Window shifting• Suppose we know coverage between P and I

– P=GAAGATACCGATTAGAAATC– GAAGA ATGAGAAATC (1, 11, 12, 13, 14, 15, 16)– I=GAAGACCGTCATGAGAAATC– Coverage = 15

• We can find the coverage between P’ and I’ efficiently– P’=AAGATACCGATTAGAAATCG– ATGAGAAATCT (10, 11, 12, 13, 14, 15, 16)– I’=AAGACCGTCATGAGAAATCT– Coverage = 11

P

P’

First length-ls substring of P

Last length-ls substring of P’

Time complexity reduced to O(n2).

28

Page 29: Detection of Spaced Motifs using Submotif Pattern Mining

Idea: Pruning on CoverageLet Sa = S[a..a+L-1] be the motif candidate.

Tb = T[b..b+L-1] is the substring being considered.

Let C be the coverage of Tb on Sa.

We can get the upper bound for the coverage of Sa+p and Tb+p easily for any p > 0.

29

• Suppose we know coverage between P and I– Sa=CGAAGATACCGTTAGAAATC– GAAGA (2)– Tb=TGAAGACCGTCTGACCGATC– Coverage = 5

• If we move both substrings to the right by one character– Sa+1=GAAGATACCGTTAGAAATCG– GAAGA AATCG (2, 16)– Tb+1=GAAGACCGTCTGACCGATCG– Coverage = 10

The coverage of Sa+p and Tb+p is upper bounded by

C + (ls-1) + p

Page 30: Detection of Spaced Motifs using Submotif Pattern Mining

Experimental Results

30

Compare the effectiveness of our tool with 13 other existing software tools.

Testing Data:1)Motif Assessment Benchmark Datasets (Martin Tompa, 2005): 56 datasets constructed from 4 different species (fly, human, mouse, yeast).2)9 datasets extracted from the literature with some identified spaced motifs.

Evaluation Measures:Sensitivity (Sn): % of known binding sites identified by the tool. Specificity (PPV): % of predicted binding sites that match with the known binding sites.

Page 31: Detection of Spaced Motifs using Submotif Pattern Mining

Experimental Results – Benchmark Dataset

31

Page 32: Detection of Spaced Motifs using Submotif Pattern Mining

Benchmark Dataset – Comparison on Four Organisms with

the best performed softwareS

eSiM

CM

C

SeS

iMC

MC

An

nS

PE

C An

nS

PE

C

Imp

rob

izer

+ Y

MF

Imp

rob

izer

+ Y

MF

Wee

der

Wee

der

32

Page 33: Detection of Spaced Motifs using Submotif Pattern Mining

Experimental Results – Spaced motif real datasets

An exampleARCA-P Literature GTTAAnnnnnnGTTAA

SPACE GTTAnnnnnATGTTA MITRA GTTAACT

SPACE also performs better (in terms of both measures) in all cases.

33

Page 34: Detection of Spaced Motifs using Submotif Pattern Mining

Conclusions

- SPACE is found to be effective in locating spaced motifs (as well as motifs without gaps).

- After the paper was published, quite a number of biologists emailed us for the tool. Some major genome laboratories, such as The Computational Biology Research Center (CBRC) of Japan, are considering to list (& include a link to our web-based tool) our tool in their collection.

- We are extending our work to handle motifs for which the same gap may have different lengths across different instances.

34

Page 35: Detection of Spaced Motifs using Submotif Pattern Mining

35

Some Related Projects

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Character

Dependence

Page 36: Detection of Spaced Motifs using Submotif Pattern Mining

36

Motifs with Nucleotide Dependence

- Joint work with Prof. Chin & Henry Leung- A new model to capture the dependency of nucleotides (characters) in a motif

e.g GGT

AA

CGG

TGACGGA

The SPSP (Scored Position specific Pattern) Model

The 5th, 6th, 7th characters are dependent, i.e., if 5th character is “T”, then the 6th and 7th characters must be “GA”.

Page 37: Detection of Spaced Motifs using Submotif Pattern Mining

37

Some on-going Bioinformatics Projects

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Nucleotide Dependenc

e

Ensemble Approaches for Combining Results from different Motif Finders

Page 38: Detection of Spaced Motifs using Submotif Pattern Mining

38

Traditional motif models

“Assume that a motif is a contiguous substring

in the sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Nucleotide Dependenc

e

Ensemble Approaches for Combining Results from different Motif FindersSingle Motif

Protein Motif Pairs Motif Modules

<Thank you!>

Page 39: Detection of Spaced Motifs using Submotif Pattern Mining

2. Positional Weight Matrix (PWM) Representation

A motif is modeled as a 4 x L matrix. -The binding site is of length L. -The 4 rows are labeled by “A”, “C”, “G”, “T”.-The jth column provides the probability of the occurrence of each nucleotide at position j of the binding site.

1 2 3 4 5 6A 0.1 0 0 1 0.1 0.8C 0 0.1 0 0 0.9 0.1G 0.1 0 1 0 0 0T 0.8 0.9 0 0 0 0.1

Positional Weight Matrix (PWM)TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Remark: Solution space is infinite, only suboptimal answers will be produced.

Page 40: Detection of Spaced Motifs using Submotif Pattern Mining

GGTAACGG

TGACGGA

P

log(1)-log(0.8)-log(0.2)-log(1)-

log(0.4)-log(0.6)-log(1)-

S

SPSP model

Remark: A length-11 string is considered as a binding site if (i) it matches with P and (ii) its score (sum of corresponding entries) is at most 3.1.

For example, the score of “CGGATGAATGG” is -log(1)+ -log(0.6) + -log(1) + -log(0.8) + -log(1) = 1.05 < 3.1. The score of “CGGACGGAAGG” is -log(1)+ -log(0.4) + -log(1) + -log(0.2) + -log(1) = 3.6 > 3.1. The string “TGGATGAATGG” does not match with P.

40

Page 41: Detection of Spaced Motifs using Submotif Pattern Mining

Significance Test & Scoring

Intuitively, a motif is significant if-The number of occurrences is a lot more than expected-The pattern is either very conserved or occurs in quite a number of input sequences.

(1)Let Occ_s(M, e) be total no. of observed occurrences of M with at most e mutations.Let E(M, e) be expected frequency of M with at most e mutations based on a set of background sequences

Occurrence score:X(M) = log [(Occ_s(M, e)/E(M, e) x total length of seq]

41

Page 42: Detection of Spaced Motifs using Submotif Pattern Mining

(2) For a sequence s with an occurrence of M, consider the most conserved instance (let e’ be the no. of mutations). Then, E(M, e’) x Len(s) is the expected freq. of this motif in s. Note that this value is small is the motif is very conserved.

Sequence-specific score:Y(M) = Sum [log (1/E(M, e’) x Len(s))] over all sequence s.Overall score = x(M) + y(M).

42

Page 43: Detection of Spaced Motifs using Submotif Pattern Mining

GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAATGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTGGTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGTGAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT

Given a set of (regulatory) sequences that possibly bind to the same transcription factor, the problem is to search for the common binding site pattern (motif) that is over-represented in the sequences.

Unlikely to occur by random!

e.g. TT occurs 17 times, but length of each seq. = 49, total length ~ 350, expected no. of occurrences ~ 20.

43

Page 44: Detection of Spaced Motifs using Submotif Pattern Mining

GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAATGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTGGTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGTGAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT

TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Motif: TTGACABinding sites: with at most 1 mismatch

It occurs 10 times, but expected number of occurrences is only:

(1/4)6 x 6 x 3 x 350 ~ 1.5

44

Page 45: Detection of Spaced Motifs using Submotif Pattern Mining

An example: hm17g

45

Page 46: Detection of Spaced Motifs using Submotif Pattern Mining

Synthetic Dataset1. Motif AGTTGTC with no spacer

2. Motif CCTGTnnnAGTTGTC containing 2 segments with spacers.

3. Motif ATCGTnnnTGACCnnnCTTTC containing 3 segments of length 5 with spacers.

4. Motif CGGCnnnnnnTCTAA containing 2 segments with a spacer.

0

0.2

0.4

0.6

0.8

1

1.2

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

nSn - MITRA

nSn - SPACE

nPPV - MITRA

nPPv - SPACE

nPC - MITRA

nPC - SPACE

nCC - MITRA

nCC - SPACE

Note: for each segment, the instances may have one mismatch.

46

Page 47: Detection of Spaced Motifs using Submotif Pattern Mining

Datasets• 9 Real datasets with known spaced motifs.

• Motif Assessment Benchmark Datasets (Tompa, 2005).– Consists of 56 datasets from 4 different species

(fly, human, mouse, yeast).– Each dataset contain 1 to 35 sequences.– Sequence length up to 3K bp.

• Six synthetic datasets containing implanted motif with variations in:– spacer length.– motif parts (segments) and – segments length.

47

Page 48: Detection of Spaced Motifs using Submotif Pattern Mining

Parameterization• Perform intersections with 12 runs of these

parameter combinations:– L = 8,15,20 (max motif length)– c = 0.5, 0.8 (coverage)– q = t, 0.5t (min support)– ls = 5 (substring length)– d = 1 (mismatches)

Experimental Results

48

Page 49: Detection of Spaced Motifs using Submotif Pattern Mining

Evaluation Measures

(Recall)

(Precision) =

(Performance Coef) =

. . (Correl Coef) =

( )( )( )( )

TPSn

TP FN

TPPPV

TP FP

TPPC

TP FN FP

TP TN FN FPCC

TP FN TN FP TP FP TN FN

49

Page 50: Detection of Spaced Motifs using Submotif Pattern Mining

Benchmark Dataset – Comparison on Four Organisms with

the best performed softwareS

eSiM

CM

C

SeS

iMC

MC

An

nS

PE

C An

nS

PE

C

Imp

rob

izer

+ Y

MF

Imp

rob

izer

+ Y

MF

Wee

der

Wee

der

50