[ieee 2013 ieee symposium on computational intelligence in bioinformatics and computational biology...

8
Genetic Algorithm for Dimer-led and Error-restricted Spaced Motif Discovery Tak-Ming Chan ∗§ , Leung-Yau Lo , Man-Leung Wong , Yong Liang and Kwong-Sak Leung Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong Email: {tmchan, lylo, ksleung}@cse.cuhk.edu.hk Department of Computing and Decision Sciences, Lingnan University, Hong Kong Email: [email protected] Faculty of Information Technology, Macau University of Science & Technology, Macau Email: [email protected] § Department of Integrative Biology and Physiology, UCLA, USA Email: [email protected] Abstract—DNA motif discovery is an important problem for deciphering protein-DNA bindings in gene regulation. To discover generic spaced motifs which have multiple conserved patterns separated by wild-cards called spacers, the genetic algorithm (GA) based GASMEN has been proposed and shown to out- perform related methods. However, the over-generic modeling of any number of spacers increases the optimization difficulty in practice. In protein-DNA binding case studies, complicated spaced motifs are rare while dimers with single spacers are more common spaced motifs. Moreover, errors (mismatches) in a conserved pattern are not arbitrarily distributed as certain highly conserved nucleotides are essential to maintain bindings. Motivated by better optimization in real applications, we have developed a new method, which is GA for Dimer-led and Error- restricted Spaced Motifs (GADESM). Common spaced motifs are paid special attention to using dimer-led initialization in the pop- ulation initialization. The results on real datasets show that the dimer-led initialization in GADESM achieves better fitness than GASMEN with statistical significance. With additional error- restricted motif occurrence retrieval, GADESM has shown better performance than GASMEN on both comprehensive simulation data and a real ChIP-seq case study. I. I NTRODUCTION In this section, spaced motif discovery is first introduced, followed by our motivations and the paper layout. A. Spaced Motif Discovery Transcription Factors (TFs) are regulatory proteins binding to certain short DNA segments (substrings) to control gene transcriptional expressions. The short DNA substrings rec- ognized and bound by TFs are called Transcription Factor Binding Sites (TFBSs), which are mostly 25 characters, or base pairs (bp), in length. Identification of TFBSs is an important problem for understanding gene regulation in biology. The DNA binding domains of a TF can recognize and bind to a collection of similar TFBSs, from which a conserved pattern called motif can be obtained. By exploiting intergenic DNA regions of co-regulated genes or sequences from high-throughput Chromatin immunopre- cipitation (ChIP) experiments [1], de novo motif discovery using computational methods have been proposed to identify the conserved and over-represented patterns. Motif discovery serves as an attractive pre-screening alternative to costly biological experiments, produces novel putative TFBS motifs for further verification, and provides significant insights into understanding regulatory mechanisms. There are three major components for a (de novo) motif discovery algorithm: the representation of the motif including its occurrence retrieval criteria (e.g. a motif represented as a string consensus and occurrences retrieved based on minimal Hamming distance); the evaluation function to rank and choose top motifs to best capture biological properties; and search/optimization to tackle the NP-hardness [2] effectively and efficiently. The numerous motif discovery algorithms can be catego- rized by their representation methods: the consensus string [3] and the position weight matrix (PWM) [4]. A consensus string of the DNA is simple and can be feasibly enumerated up to a certain length (e.g. 12 in Weeder [3]). A PWM shows the quantitative frequencies or weights of nucleotides in the motif. A PWM is usually generated from a set of occurrences using consensus based methods or sampling. The evaluation functions in general measure conservation and over-representation, i.e. similarity and high frequency of the motif occurrences. Representative evaluation functions include Information Content (IC) [4], maximum a posterior (MAP) [5] and Bayesian scores [6]. Besides local search [5] and single-point sampling [7], heuristic methods, in particular evolutionary computation and genetic algorithms (GAs) [8]– [14], have shown to be promising in optimization capability. They also adopt more general models for handling uncertain motif widths and occurrence abundance [15]–[17]. Most com- putational methods are within the above scope, discovering motifs as single contiguous conserved patterns, which can be referred to as monad motifs. For a generic motif, a portion of DNA segments are with chemical contacts to TFs, which are most conserved and specific to form contiguous conserved patterns (referred to as binding cores [18]–[20]). On the other hand, conservation is not as critical in portions between binding cores (the so- called gaps or spacers). Common spaced motifs are dyads where two conserved patterns (i.e. two monad motifs) are 198 978-1-4673-5875-0/13/$31.00 c 2013 IEEE

Upload: kwong-sak

Post on 17-Mar-2017

217 views

Category:

Documents


4 download

TRANSCRIPT

Genetic Algorithm for Dimer-led andError-restricted Spaced Motif Discovery

Tak-Ming Chan∗§, Leung-Yau Lo∗, Man-Leung Wong†, Yong Liang‡ and Kwong-Sak Leung∗∗Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong

Email: {tmchan, lylo, ksleung}@cse.cuhk.edu.hk†Department of Computing and Decision Sciences, Lingnan University, Hong Kong

Email: [email protected]‡Faculty of Information Technology, Macau University of Science & Technology, Macau

Email: [email protected]§Department of Integrative Biology and Physiology, UCLA, USA

Email: [email protected]

Abstract—DNA motif discovery is an important problem fordeciphering protein-DNA bindings in gene regulation. To discovergeneric spaced motifs which have multiple conserved patternsseparated by wild-cards called spacers, the genetic algorithm(GA) based GASMEN has been proposed and shown to out-perform related methods. However, the over-generic modelingof any number of spacers increases the optimization difficultyin practice. In protein-DNA binding case studies, complicatedspaced motifs are rare while dimers with single spacers aremore common spaced motifs. Moreover, errors (mismatches) ina conserved pattern are not arbitrarily distributed as certainhighly conserved nucleotides are essential to maintain bindings.Motivated by better optimization in real applications, we havedeveloped a new method, which is GA for Dimer-led and Error-restricted Spaced Motifs (GADESM). Common spaced motifs arepaid special attention to using dimer-led initialization in the pop-ulation initialization. The results on real datasets show that thedimer-led initialization in GADESM achieves better fitness thanGASMEN with statistical significance. With additional error-restricted motif occurrence retrieval, GADESM has shown betterperformance than GASMEN on both comprehensive simulationdata and a real ChIP-seq case study.

I. INTRODUCTION

In this section, spaced motif discovery is first introduced,

followed by our motivations and the paper layout.

A. Spaced Motif Discovery

Transcription Factors (TFs) are regulatory proteins binding

to certain short DNA segments (substrings) to control gene

transcriptional expressions. The short DNA substrings rec-

ognized and bound by TFs are called Transcription Factor

Binding Sites (TFBSs), which are mostly ≤ 25 characters,

or base pairs (bp), in length. Identification of TFBSs is

an important problem for understanding gene regulation in

biology. The DNA binding domains of a TF can recognize and

bind to a collection of similar TFBSs, from which a conserved

pattern called motif can be obtained.By exploiting intergenic DNA regions of co-regulated genes

or sequences from high-throughput Chromatin immunopre-

cipitation (ChIP) experiments [1], de novo motif discovery

using computational methods have been proposed to identify

the conserved and over-represented patterns. Motif discovery

serves as an attractive pre-screening alternative to costly

biological experiments, produces novel putative TFBS motifs

for further verification, and provides significant insights into

understanding regulatory mechanisms. There are three major

components for a (de novo) motif discovery algorithm: the

representation of the motif including its occurrence retrieval

criteria (e.g. a motif represented as a string consensus and

occurrences retrieved based on minimal Hamming distance);

the evaluation function to rank and choose top motifs to best

capture biological properties; and search/optimization to tackle

the NP-hardness [2] effectively and efficiently.

The numerous motif discovery algorithms can be catego-

rized by their representation methods: the consensus string

[3] and the position weight matrix (PWM) [4]. A consensus

string of the DNA is simple and can be feasibly enumerated

up to a certain length (e.g. ≤ 12 in Weeder [3]). A PWM

shows the quantitative frequencies or weights of nucleotides

in the motif. A PWM is usually generated from a set of

occurrences using consensus based methods or sampling.

The evaluation functions in general measure conservation and

over-representation, i.e. similarity and high frequency of the

motif occurrences. Representative evaluation functions include

Information Content (IC) [4], maximum a posterior (MAP)

[5] and Bayesian scores [6]. Besides local search [5] and

single-point sampling [7], heuristic methods, in particular

evolutionary computation and genetic algorithms (GAs) [8]–

[14], have shown to be promising in optimization capability.

They also adopt more general models for handling uncertain

motif widths and occurrence abundance [15]–[17]. Most com-

putational methods are within the above scope, discovering

motifs as single contiguous conserved patterns, which can be

referred to as monad motifs.

For a generic motif, a portion of DNA segments are with

chemical contacts to TFs, which are most conserved and

specific to form contiguous conserved patterns (referred to

as binding cores [18]–[20]). On the other hand, conservation

is not as critical in portions between binding cores (the so-

called gaps or spacers). Common spaced motifs are dyads

where two conserved patterns (i.e. two monad motifs) are

198978-1-4673-5875-0/13/$31.00 c©2013 IEEE

separated by one spacer [21], [22]. There are also spaced

motifs [23], [24] resulting from common machinery in gene

regulation called dimerization. In particular, two identical TFs

(homodimers, e.g. [23]) or two different but structurally similar

TFs (heterodimers, e.g. [24]) can bind together and then bind

to the target TFBS. In such cases, the two segments are

likely to be reverse complements of each other [23], [24],

and such dimers are also called palindromic (dimer) motifs.

For example, CAGGTnnACCTG is a palindromic motif where

CAGGT and ACCTG are reverse complements: CAGGT be-

comes TGGAC after string reverse, and then ACCTG after

applying complement rules A ↔ T, C ↔ G.

There are fewer algorithms designed for spaced motifs,

and they either require gaps in a motif to be of the same

fixed width, or handle the dyad with only 1 gap [21], [22],

[25]. SPACE [26], a consensus-based method, was proposed to

discover more generic spaced motifs with flexible gap numbers

and ranges. SPACE was shown to outperform the other spaced

motif algorithms [22], [25] on various real and benchmark

datasets. However, because of the time complexity of frequent

itemset mining employed, constraints are imposed in SPACE:

all candidate motifs are restricted to be derived exactly from

the input occurrences, while the optimal motif consensus may

not be any of them. The computational time can also be

unbounded. The recent GA based method GASMEN [27] has

shown considerably improved performance over SPACE for

both short monad and generic spaced motifs.

With the motivations to apply novel GA on generic spaced

motif discovery with flexible width ranges, GASMEN (Ge-

netic Algorithm for Spaced Motif Elicitation on Nucleotides)

has been developed [27], which has better generality and

performance than SPACE. GASMEN searches consensuses

with a wide range of possible widths (up to 25bp) and

relaxes substantial constraints of SPACE. GASMEN employs

submotif indexing to partition the search space into smaller

sub-spaces for GA to more likely reach optimality. In the GA,

multiple-motif control is employed and probabilistic refine-

ments are proposed to improve motif quality. The experimental

results on real spaced motif datasets show that GASMEN is

able to find motifs more accurately than SPACE. GASMEN

is also capable of finding monad motifs, outperforming both

Weeder [3] and SPACE on most of the 8 real datasets.

B. Motivations

Despite GASMEN’s encouraging results, biological phe-

nomena can be modeled and incorporated in the methodology

for better effectiveness in real practice. In particular, the over-

generic spaced motif modeling can be improved by specifically

modeling naturally common spaced motifs, e.g. dyads and

palindromic dimers, during population initialization. Error-

restricted effects on different motif positions can also be

formulated to better filter out false motif occurrences. They

are elaborated as follows:

1) Dimer-led initialization: A dimer, which consists of one

in-between spacer and two conserved patterns, is a common

spaced motif in real data. If the two conserved patterns are

reverse complements of each other, it is a palindromic dimer.

The reason is that a palindromic motif is simultaneously

bound by two identical or structurally similar TFs on the

double strands symmetrically, which can be clearly shown

in 3D structures [23], [24], [28]. Therefore, in population

initialization of a GA for spaced motif discovery, it is more

cost-effective and realistic to generate certain amount of dimer-

led motifs, while generic spaced motifs with arbitrary spacers

should also be considered for both theoretical and potential

completeness. This can considerably reduce the search space

and optimization difficulty for discovering real biological

motifs. On the other hand, considering the complexity of the

problem nature, we cannot only consider dimer-led motifs but

also have to cover generic spaced motifs. In summary, dimer

motifs should receive special attention as they are common

spaced motifs while generic spaced motifs should also be

covered. This motivation can be realized with well-designed

population initialization and genetic operators in GA based

methods.

2) Error restriction: In recent TF-TFBS binding sequence

association studies [18]–[20], it is observed that in the con-

served patterns referred to as binding cores, some positions

are more flexible to mutate while the others are critical to be

conserved to maintain the chemical bonds [20]. As a result,

the occurrence mismatches/errors from the motif consensus are

clustered rather than evenly distributed. Furthermore, there are

real cases of motif variations in particular positions called sub-

types [29], which can be exploited by computational methods

on some specific monad and dyad cases [30]–[32]. Therefore,

introducing error restriction strategies to penalize scattered

mismatch positions can potentially remove false motif occur-

rences. None of the recent consensus-based methods [3], [26],

[27] consider error-restricted motif occurrences. As a result,

the motif quality may be affected on real-data applications,

especially on high-throughput ChIP-chip and ChIP-seq data

[1] where up to thousands of motif occurrences are to be

identified accurately.

C. Paper Outline

Motivated by better modeling and optimization performance

in real-data applications, we develop the novel Genetic Al-

gorithm for Dimer-led and Error-restricted Spaced Motifs

(GADESM), which introduces dimer-led population initial-

ization and motif occurrence error-restriction to GASMEN.

Dimers are specially handled in initialization while generality

on evolving generic spaced motifs is maintained (see results).

Error-restricted effects are considered to remove potential false

motif occurrences. The detailed methodology of GADESM is

elaborated in Section II. Experimental results on comprehen-

sive simulated and real datasets are reported in Section III.

Discussion and conclusion are available in Section IV.

II. METHODS

In this section, spaced motifs are first introduced in the

context of dimers and error restriction, followed by the

methodology of GADESM.

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 199

A. Error-restricted Spaced Motifs

We first introduce the generic spaced motif definitions

employed by SPACE [26] and GASMEN [27], and then further

elaborate dimer motifs and error-restricted spaced motifs.

1) Generic Spaced Motifs and Dimers: A generic DNA

spaced motif (or simply a motif) M is a width W string

formed by characters of {A, C, G, T, n}, where each maximal

substring of consecutive “n”s represents a gap (or spacer)

and each maximal contiguous substring of other characters

represents a conserved pattern called segment. To avoid trivial

conserved patterns, each segment should be ≥ w for a

predefined w, which is the minimal segment width. Any w-(length) segment without “n” is called a submotif. We set

w = 4 and W = 25 in the experiments.

A dyad motif, or simply a dimer, is a special case of the

spaced motif with exactly two segments separated by one

spacer. The two segments are independent and may be in

different lengths.

A palindromic dimer (motif) is a special case of the

dyad/dimer, as well as the spaced motif, with exactly two

segments each being the reverse complement (i.e. palindromic)

of the other, separated by one spacer. M in the follow-

ing illustrative example is a spaced motif, a dyad and a

palindromic dimer, because the two segments CAGTCA and

TGACTG are palindromic, separated by one spacer with 6

“n”s. During motif discovery, palindromic dimers can be more

easily detected than generic spaced motifs in a data-driven

manner via sampling from the input substrings.

2) Error-restricted Motif Occurrence: In GASMEN and

SPACE, for a defined spaced motif M , a substring is called its

occurrence O if for every submotif in M , the corresponding

non-n w-substring in O is within Hamming distance d = 1,

i.e. every sliding w-segment within a conserved pattern is

with at most 1 error. Note that the spacers (“n”) portions

are not considered as errors. The following example with

w = 4,W = 25 illustrates the 8 occurrences for a given

spaced motif M , where the effective motif width is 18 (the 7

trailing “n”s trimmed for a practical motif), and the errors in

the occurrences are underlined.

• M=CAGTCAnnnnnnTGACTGnnnnnnn• O1=CAGTGAccacgcTCACTC• O2=CAGTCTggtgcgTGTCTG• O3=CAGTCAtactgaTGACTG• O4=GAGTCGatacttTGTCTG• O5=CAGTCTgggataTGACTG• O6=CTGTCTtgcaagGGACTT• O7=CAGTCAtactgaTGACTG• O8=CAGTCAtactgaTGACTG

According to the uneven error distribution mentioned, some

of these occurrences may be false positives, as only the max-

imal error (Hamming distance d = 1) is specified but not the

error distributions. On the other hand, it is difficult to anticipate

the error distribution of the motif occurrences in advance in

motif discovery. To alleviate the problem, a two-stage data-

driven method is proposed for error-restricted motif occurrence

retrieval. The idea is that the majority of the occurrences of a

generic spaced motif should be true positive occurrences after

the Hamming distance filtering. Only a few false positives are

left with errors scattering around. Assigning penalties inversely

proportional to the error frequencies can effectively remove

possible false positives, because they exhibit errors in positions

beyond the error-restricted (error-clustered) positions.

The method is elaborated as follows. Given a motif M , its

occurrences are first retrieved according to the generic spaced

motif definitions. With the first-stage occurrences, an error

map is established for the error frequencies e(i) over different

positions in the segments:

e(i) =no. of errors

no. of occurrences(1)

where i is the ith position in a segment. In the illustrative

example above, e(1) = 1/8 because O4 has an error (a

mismatch) G compared to C in the first position of M .

Similarly, e(2) = 1/8 and e(6) = 4/8. For e(i), an error-

restricted penalty is proposed as pen(i) = 1− e(i). The error

frequency for “n” can be set as 1, and therefore there will be

no penalty as a spacer can match any character. Table I shows

part of the error map and error penalties of the illustrative

example.

TABLE I: Illustration of the Error Map and Error Penalties

M C A G T C A n n ...i 1 2 3 4 5 6 7 8 ...

e(i) 1/8 1/8 0/8 0/8 1/8 4/8 8/8 8/8 ...pen(i) 7/8 7/8 8/8 8/8 7/8 4/8 0/8 0/8 ...

The previous occurrences are re-evaluated according to their

total penalties within a segment, if the total penalty of any

segment is > 1.0, the occurrence is eliminated in the error-

restricted occurrence retrieval. Therefore, O1, O4 and O6 are

eliminated, where the total penalties for the two segments are

shown after the occurrences below (underlined for penalties

> 1.0).

• M=CAGTCAnnnnnnTGACTGnnnnnnn• O1=CAGTGAccacgcTCACTC 0.875 1.625• O2=CAGTCTggtgcgTGTCTG 0.500 0.750• O3=CAGTCAtactgaTGACTG 0.000 0.000• O4=GAGTCGatacttTGTCTG 1.375 0.750• O5=CAGTCTgggataTGACTG 0.500 0.000• O6=CTGTCTtgcaagGGACTT 1.375 1.625• O7=CAGTCAtactgaTGACTG 0.000 0.000• O8=CAGTCAtactgaTGACTG 0.000 0.000

With the error-restricted occurrence retrieval, occurrences

with errors scattering around are penalized because they are

more likely to be false positives. Based on the occurrence

statistics obtained from the generic spaced motif, penalties

can be generated in a data-driven (occurrence based) manner

without directly enumerating all possible error distributions.

The final spaced motif and its filtered occurrences are more

conserved and better match the biological observations. The

procedure is integrated in the evaluation part of GADESM.

200 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

B. Dimer-led Population Initialization

Generic spaced motifs impose a huge pattern space for

search and optimization. Therefore, submotif indexing has

been employed by GASMEN [27]. In particular, all substrings

in the input dataset are indexed according to w submotif in-

dices: AAAA, AAAC, AAAG, ..., TTTT. GA is then applied to

each submotif index individually. The partitioning allows the

GA to have higher chance to achieve optimality in smaller sub-

space. The same submotif indexing is adopted in GADESM.Different from the hybridized population initialization in

GASMEN [27] where half of the population is explicitly

initialized to be monad motifs and half generic spaced motifs,

we spare a proportion of initial population for dimers in

GADESM. Given a submotif index (e.g. ACCG), an individual

has 1/3 chance to be initialized as a monad motif, a generic

dyad motif, or a palindromic dimer, respectively. The monad

case is the same as that in GASMEN. The dyad case is a

special case of the generic spaced motif where there are only

two segments separated by a randomly generated spacer. Note

that the two segments are independent and can be in different

lengths for a dyad.The last palindromic dimer case is handled in a data-driven

manner as mentioned before. It is difficult to directly generate

a palindromic dimer motif as there is a huge search space

for segment pattern, segment length and spacer length. On

the other hand, if a palindromic dimer motif does exist in

the input dataset, it is easy to find some occurrences with two

approximate palindromic substrings. Therefore, for a submotif

index (e.g. ACCG), we randomly select a substring belonging

to it, e.g. ACCGTACATACGGT..., and then find the maximal

substring that can be a reverse complement (palindrome) of the

extended submotif index within maximal error d = 1. In this

case, ACGGT is an exact palindrome of ACCGT, one character

extended from the index ACCG. The motif individual is

then initialized as ACCGTnnnnACGGTnnn... If the current

substring does not have any approximate palindrome, the motif

individual is then initialized to be a monad motif randomly as

a degenerate case.One may note that the population initialization in GADESM

does not explicitly produce more complicated generic spaced

motifs with more than two segments. This does not mean

GADESM prohibits generic spaced motifs to be evolved as the

genetic operators and probabilistic refinement adopted from

GASMEN are able to produce more complicated spaced motifs

if they are the cases with higher fitness. As a result, the

dimer-led population initialization in GADESM pays special

attention to the common dimer motifs, while together with

the other GA components generic spaced motifs can still be

found. This is also verified by the overall better performance

in the simulation experiments with 3-segment (3block) generic

spaced motifs.

C. The Overall GADESM

To highlight the contributions of dimer-led population ini-

tialization and error-restricted occurrence retrieval, GADESM

adopts the same other procedures of GASMEN [27], which

TABLE II: The pseudo-code of GADESM

Motif width W , submotif width w, distance d,motif number n, difference threshold α

Submotif indexingfor each submotif index {

Dimer-led Population Initialization for monad, dyad, palindromic dimerfor each generation g {

Same GA procedures as GASMEN [27]Evaluation (σ(M)) with Error-restricted Occurrence Retrieval

}}Output in the top final motifs

serves as a powerful platform for GA based spaced motif

discovery. In particular, to fairly evaluate the optimization

capability of GADESM, we use the same evaluation function

σ for both GASMEN and GADESM, which considers the best

occurrence, if any, on each sequence:

For each input sequence {Si} and a candidate motif M , we

consider the most conserved (error-restricted) occurrence of

M in each sequence Si and let ei be the Hamming distance

of this best occurrence. 1/N(Si) represents the frequency of

the best occurrence in sequence Si, where N(Si) is its total

count of characters. E(M, ei) is the expected frequency that

occurrences of M come from the non-motif background. The

frequency can be calculated using pre-computed background

statistics based on Markov chains [3], [26], [27]. Thus the log

relative frequency ratio σ(M) between all best occurrences of

M and the background is defined as:

σ(M) =∑

i

log1

E(M, ei) ∗N(Si). (2)

If the pattern is very conserved and/or its best occurrences

are in many sequences, σ(M) is large. Note that the evaluation

function is suitable for both monad and spaced motifs.

The overall GADESM approach is illustrated in Table II

where the new contributions, namely the Dimer-led Popu-lation Initialization and Error-restricted Occurrence Re-trieval are shown in bold. More details of the adopted features

can be found in GASMEN [27].

III. EXPERIMENTAL RESULTS

In this section, we compare GADESM and GASMEN on

both simulation and real experiments to evaluate the dimer-

led population initialization and error-restricted occurrence re-

trieval. In all the experiments, both GADESM and GASMEN

were set with the same parameters. The GA population size

was 100, mutation rate was 0.5, generation number g = 100,

and unchanged generation count for convergence was 10.

A. Simulation Experiments

We first perform comprehensive and challenging simulation

experiments to verify the performance of GADESM method

against GASMEN.

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 201

1) Simulation Data Generation: We generated simulation

data with four different types of motifs (monad, 3block, dimerand rdimer), to evaluate the performance difference between

GASMEN and GADESM. The different motif types and their

patterns are listed in Table III, where a string of X’s means a

specific conserved pattern (segment) will be randomly gener-

ated, denoted as a block specifically for the simulation part.

Strings of “n”s are spacers. So monad is a single conserved

block (segment); 3block consists of three blocks with two

spacers; dimer and rdimer both have two blocks with a spacer

in between. In dimer (dyad), the two segments are different

patterns, while in rdimer, the spaced motif is palindromic, i.e.

the two segments are reverse complement of each other. They

simulate dimers for common dyads and palindromic dimers

respectively.

TABLE III: Motif Types Generated

Type Patternmonad XXXXXXXX3block XXXXnnXXXXXXnnnnXXXXXXdimer XXXXXnnnnXXXXXrdimer XXXXXnnnnXXXXX

We also simulated the clustered errors as follows: for real

motif occurrences, within each segment, 40% (round up)

of the positions are randomly chosen to allow substitution

errors, e.g. in this rdimer XeeXXnnnnXeXeX, only positions

2 and 3 of the first segment, and positions 2 and 4 of the

second segment are allowed to have substitution errors in

its real occurrences. In each real occurrence, each segment

independently has a probability ε of having a substitution

error in one of the allowed positions. We also generated

fake motif occurrences (false positives), where some segments

contain substitution errors out of the allowed positions within

the segment. Therefore the fake motifs occurrences are less

conserved than the real motif occurrences, and the real motif

occurrences have substitution errors clustered in only certain

positions within a segment.

For each of the four motif types, we tested ε = 0.5 and

ε = 0.8. For each setting, we generated 10 random datasets.

In each dataset, we generated 100 sequences, 80 of which

contained motif occurrences at uniformly randomly chosen

positions, and the remaining 20 were only the background

sequences. Then we randomly chose the positions within the

motif where substitution errors were allowed, such that the

motif occurrences containing substitution errors in disallowed

positions were fake motifs, and the remaining occurrences

were true occurrences. Under this method of generation, the

number of true occurrences is relatively small, mimicking the

challenging situation of having noisy background containing

patterns similar to the true motifs. The average number of true

occurrences in each setting is listed in Table IV. The back-

ground was generated according to uniform multinomial dis-

tribution. We have therefore generated a total of 80 (4x2x10)

simulation datasets. We ran GASMEN and GADESM on these

datasets with the same parameter settings.

TABLE IV: Average Number of True Motif Occurrences

monad 3block dimer rdimerε = 0.5 59.5 32.8 40.8 38.1ε = 0.8 48.2 16.6 22.3 22.6

2) Performance Evaluation Metrics: We employ the stan-

dard performance evaluation metrics as listed in Table V,

namely recall, precision and f-measure on both the site-level

(prefix s−) and nucleotide-level (prefix n−). For site-level, a

prediction is correct if and only if it covers at least 40% of

the actual motif occurrence.

TABLE V: Performance Metrics

Metric Recall Precision F-measureMeaning TP

TP+FNTP

TP+FP2∗Recall∗PrecisionRecall+Precision

3) Simulation Results: Figure 1 shows the boxplots of

the average performance comparisons between GASMEN and

GADESM for ε = 0.8. Because of space limit, here we show

only the nucleotide-level metrics and site-level f-measure,

because the results on both levels are consistent as illustrated

by Figures 1c and 1d. The results for ε = 0.5 are qualitatively

similar, leading to the same conclusion.

In Figure 1a, GADESM shows considerably better preci-

sion than GASMEN because the error-restricted occurrence

retrieval can effectively remove deceptive fake occurrences.

On the other hand, GADESM maintains comparable recall as

shown in Figure 1b except for the 3block type cases. Overall

speaking, GADESM has better performances than GASMEN

with respect to f-measure in all data types. The differences

are larger in the common dyad (dimer) and palindromic dimer

(rdimer) cases, indicating the dimer-led population initializa-

tion can considerably improve the GA optimization capacity in

these cases. On the other hand, GADESM also demonstrates

better performance in generic spaced motif (3block) and

monad cases. As a result, the dimer-led population initial-

ization and error-restricted occurrence retrieval for GADESM

have been verified on the comprehensive and challenging

simulation experiments.

B. Real Experiments for Dimer-led Initialization

Besides the simulation data, we also evaluate the dimer-

led population initialization alone (i.e. without error-restricted

occurrence retrieval) on real datasets in terms of optimization

capacity.

1) 8 Real Benchmark Datasets: The 8 real benchmark

datasets detailed in [15], [27] were employed to test the

optimization capacity of GADESM and GASMEN. The 8

datasets cover different motif properties: motif widths from

6 to 22, sequence lengths from 105 to over 300, and sequence

numbers from 17 to 95. Among the 8 datasets, the CRP (cyclic

AMP receptor protein) binding site motif in E. coli is a spaced

motif with width 22, which contains two weakly conserved

monad motifs separated by a gap [4]. The ERE dataset con-

tains binding sites called estrogen response elements (EREs)

202 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

Fig. 1: Comparison of GASMEN and GADESM on Simulation Data, ε = 0.8

(a) Nucleotide-level Precision (b) Nucleotide-level Recall

(c) Nucleotide-level F-measure (d) Site-level F-measure

with high affinity and activates gene expression in response

to estradiol [33]. The E2F family [34] binding sites are from

mammalian sequences. The five additional datasets for the TFs

of CREB, MEF2, MYOD, SRF and TBP are from the ABS

eukaryotic database [35].2) Comparisons and Statistical Significances: In this eval-

uation, only the population initialization methods are different

for GADESM and GASMEN. We investigate the optimization

results in terms of the final fitness in top ranked motifs reported

by the two methods. For each of the 8 datasets, GADESM and

GASMEN were run for 30 times, and the averaged top motif

fitness scores with the standard deviations (SD) are shown

in Table VI. GADESM not only achieves better averaged

fitness scores in all of the datasets, but also shows lower SD

in most cases. To evaluate the statistical significance of the

averaged top reported fitness scores, the one-sided Wilcoxon

rank-sum test (Mann-Whitney U test) [36] was performed with

the p-value threshold p ≤ 0.05. The results for all 8 datasets

are shown to be significant as shown in Table VI. Without

any knowledge which datasets have palindromic motifs or

not, the dimer-led population initialization generates more

realistic candidate motifs (individuals) in real practice for

motif discovery. Provided that the same evaluation function,

same genetic components and probabilistic refinement are

adopted, the population initialization method of GADESM

shows significantly better optimization results compared with

GASMEN.

C. Real ChIP-seq Case Study with Error-restriction

Lastly, we demonstrate a real ChIP-seq dataset case study

where the error-restricted occurrence retrieval of GADESM

would potentially improve GASMEN’s motif quality.

1) c-Myc Dataset: The c-Myc dataset is a ChIP-seq dataset

chosen from the recent stem cell study [37]. It contains

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 203

Data GADESM GASMEN p-valueCREB 60.248± 2.642 59.278± 2.701 0.003417CRP 54.608± 1.334 48.460± 6.340 0.000003E2F 87.813± 0.000 87.503± 0.509 0.000063ERE 83.519±11.218 66.886±10.560 0.000003

MEF2 65.64± 4.893 62.318± 4.719 0.006691MYOD 60.515± 4.226 56.077± 2.581 0.000002

SRF 87.448± 4.193 79.953± 4.974 0.000000TBP 118.839± 4.800 113.728± 6.324 0.000366

TABLE VI: Comparisons of the averaged top reported fitness

scores between GADESM and GASMEN on the 8 real bench-

mark datasets. ± indicates SD.

3418 sequences each around 100bp in length. We selected

this dataset because the binding properties of the Myc TF

and its patterns are well studied. The TFBS motif for the

oncogenic Myc family is called the E-box (Enhancer Box)

with consensus CAnnTG, where a palindromic canonical motif

is CACGTG. Previous research literature has discussed about

the binding preferences of the binding motifs: CACGTG for

Myc, of which the 3D TF-TFBS binding structure is known

[24]. Therefore, without any error-restricted control for a motif

discovery algorithm, false positives similar to the Myc motif

CACGTG, e.g. occurrences like nACGTn, may be included

and thus affect the accuracy for the Myc motif. Importantly,

such subtle difference may affect the power to distinguish Myc

motifs from other motifs.

2) Error-restriction on c-Myc: With GASMEN tested on

the c-Myc dataset, we demonstrate that the discovered motif

occurrences could be further refined using error-restricted

occurrence retrieval of GADESM. GASMEN successfully

identified the CACGTG c-Myc motif shown in Figure 2a.

However, a number of occurrences deviate significantly from

the canonical motif, and only 1679 out of the 2847 discovered

occurrences are matched with the regular expression CAnnTG.

The error-restricted occurrence retrieval of GADESM was then

applied, resulting in the error position penalties for AGCGTG:

0.93 0.84 0.87 0.87 0.86 0.94. Occurrences very likely to

be false positives according to the binding properties were

removed, and the pattern generated from these occurrences

is shown in Figure 2c). As a result, an improved and more

conserved CACGTG motif was output, where the underlined

C and G are more conserved, with 2769 occurrences (Figure

2b).

IV. DISCUSSION AND CONCLUSION

Existing generic spaced motif discovery methods are either

limited or over-generic in modeling. Most consensus based

motif discovery algorithms do not have special motif occur-

rence retrieval where false positives are likely to be introduced.

Motivated by our biological observations and better optimiza-

tion performance, we have developed the new GADESM (GA

for Dimer-led and Error-restricted Spaced Motifs) to discover

spaced motifs, based on the successful GA-based GASMEN.

With the dimer-led population initialization, common spaced

motifs such as dyads and palindromic dimers are paid more

attention while the generality on evolving generic spaced

motifs is still maintained by adopting the genetic operators in

GASMEN. With error-restricted occurrence retrieval, motif oc-

currences are re-evaluated in order to eliminate false positives

with errors scattering around. The results on real datasets have

shown that the dimer-led initialization in GADESM achieves

statically significant better fitness than that in GASMEN.

Additionally, with error-restricted motif occurrence retrieval,

GADESM has shown better performance than GASMEN in

spaced motif discovery on synthetic and real data. Further-

more, the error-restricted occurrence retrieval is potentially

useful for improving motif quality on ChIP-seq data.

In summary, we have focused on improving the population

initialization and motif occurrence retrieval, and they have

achieved very promising results compared with GASMEN.

However, in the future, we will further investigate into the

genetic operators, especially the local operators for overall

improvement in the GA for real spaced motif discovery

applications. The overhead introduced by the two-stage error-

restricted occurrence retrieval will be reduced by designing

more advanced one-round error-restricted occurrence retrieval

methods. The submotif indexing will be revisited for better

efficiency as there are now a large number of indices and GA

has to run once on each, but the trade off in performance has

to be carefully handled.

ACKNOWLEDGMENT

This research is partially supported by the Direct Grant

of CUHK, the General Research Fund (Project Number:

LU310111) of Hong Kong SAR, China, and the Macau Sci-

ence and Technology Develop Fund (Grant No. 017/2010/A2)

of Macau SAR, China.

REFERENCES

[1] A. D. Smith, P. Sumazin, D. Das, and M. Q. Zhang, “Mining ChIP-chipdata for transcription factor and cofactor binding sites,” Bioinformatics,vol. Suppl 1, no. 20, pp. i403–i412, 2005.

[2] M. Li, B. Ma, and L. Wang, “Finding similar regions in many se-quences,” Journal of Computer and System Sciences, vol. 65, pp. 73–96,2002.

[3] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder web:discovery of transcription factor binding sites in a set of sequences fromco-regulated genes,” Nucleic Acids Res., vol. 32, pp. W199–W203, 2004.

[4] G. D. Stormo, “Computer methods for analyzing sequence recognitionof nucleic acids,” Annu. Rev. BioChem., vol. 17, pp. 241–263, 1988.

[5] T. L. Bailey, “Fitting a mixture model by expectation maximizationto discover motifs in biopolymers,” in Proceedings of the SecondInternational Conference on Intelligent Systems for Molecular Biology.AAAI Press, 1994, pp. 28–36.

[6] S. T. Jensen, X. S. Liu, Q. Zhou, and J. S. Liu, “Computational discoveryof gene regulatory binding motifs: a bayesian perspective,” StatisticalScience, vol. 19, no. 1, pp. 188–204, 2004.

[7] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. DeMoor, P. Rouze,and Y. Moreau, “A gibbs sampling method to detect overrepresentedmotifs in the upstream regions of coexpressed genes,” J. Comput. Biol.,vol. 9, pp. 447–464, 2002.

[8] D. Wang and X. Li, “GAPK: genetic algorithms with prior knowledgefor motif discovery in DNA sequences,” in CEC’09: Proceedings ofthe Eleventh conference on Congress on Evolutionary Computation.Piscataway, NJ, USA: IEEE Press, 2009, pp. 277–284.

[9] T.-M. Chan, K.-S. Leung, and K.-H. Lee, “TFBS identification basedon genetic algorithm with combined representations and adaptive post-processing,” Bioinformatics, vol. 24, no. 3, pp. 341–349, 2008.

204 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

Fig. 2: The c-Myc motif discovered without and with Error-restriction. The red lines indicating the highest information content

show that first C and last G are more conserved. Motifs are generated using WebLogo [38]

(a) Without Error-restriction (b) With Error-restriction of GADESM

(c) The Pattern of Occurrences Removed Using Error-Restriction

[10] M. A. Lones and A. M. Tyrrell, “A co-evolutionary framework forregulatory motif discovery,” in Evolutionary Computation, 2007. CEC2007. IEEE Congress on, 2007, pp. 3894–3901.

[11] M. Stine, D. Dasgupta, and S. Mukatira, “Motif discovery in upstreamsequences of coordinately expressed genes,” in Evolutionary Computa-tion, 2003. CEC ’03. The 2003 Congress on, vol. 3, 2003, pp. 1596–1603Vol.3.

[12] T. K. Paul and H. Iba, “Identification of weak motifs in multiple biolog-ical sequences using genetic algorithm,” in GECCO ’06: Proceedingsof the 8th annual conference on Genetic and evolutionary computation,2006, pp. 271–278.

[13] L. Li, “GADEM: A Genetic Algorithm Guided Formation of SpacedDyads Coupled with an EM Algorithm for Motif Discovery,” Journalof Computational Biology, vol. 16, no. 2, pp. 317–329, Feb. 2009.

[14] J. wei Luo and T. Wang, “Motif discovery using an immune geneticalgorithm,” Journal of Theoretical Biology, vol. 264, no. 2, pp. 319 –325, 2010.

[15] Z. Wei and S. T. Jensen, “GAME: detecting cis-regulatory elementsusing a genetic algorithm,” Bioinformatics, vol. 22, no. 13, pp. 1577–1584, 2006.

[16] T.-M. Chan, G. Li, K.-S. Leung, and K.-H. Lee, “Discoveringmultiple realistic tfbs motifs based on a generalized model,” BMCBioinformatics, vol. 10, no. 1, pp. 321+, October 2009. [Online].Available: http://dx.doi.org/10.1186/1471-2105-10-321

[17] T.-M. Chan, K.-S. Leung, and K.-H. Lee, “Memetic algorithms for denovo motif discovery,” Evolutionary Computation, IEEE Transactionson, vol. 16, no. 5, pp. 730 –748, oct. 2012.

[18] K.-S. Leung, K.-C. Wong, T.-M. Chan, M.-H. Wong, K.-H. Lee, C.-K. Lau, and S. K. W. Tsui, “Discovering protein-dna binding sequencepatterns using association rule mining,” Nucleic Acids Res, vol. 38, pp.6324–6337, 2010.

[19] T.-M. Chan, K.-C. Wong, K.-H. Lee, M.-H. Wong, C.-K. Lau, S. K.Tsui, and K.-S. Leung, “Discovering approximate-associated sequencepatterns for protein-DNA interactions,” Bioinformatics, vol. 27, no. 4,pp. 471–478, Feb. 2011.

[20] T.-M. Chan, K.-S. Leung, K.-H. Lee, M.-H. Wong, T. C.-K. Lau, andS. K. W. Tsui, “Subtypes of associated protein-dna (transcription factor-transcription factor binding site) patterns,” Nucleic Acids Res, vol. 40,no. 19, pp. 9392–9403, 2012.

[21] S. Sinha and M. Tompa, “Ymf: a program for discovery of noveltranscription factor binding sites by statistical overrepresentation,” Nucl.Acids Res., vol. 31, no. 13, pp. 3586–3588, July 2003.

[22] E. Eskin and P. A. Pevzner, “Finding composite regulatory patterns inDNA sequences.” Bioinformatics, vol. 18 Suppl 1, 2002.

[23] P. C. Ma, M. A. Rould, H. Weintraub, and C. O. Pabo, “Crystal structureof myod bhlh domain-dna complex: perspectives on dna recognition andimplications for transcriptional activation.” Cell, vol. 77, no. 3, pp. 451–9, 1994.

[24] S. K. Nair and S. K. Burley, “X-ray structures of myc-max and mad-max recognizing dna. molecular bases of regulation by proto-oncogenictranscription factors.” Cell, vol. 112, no. 2, pp. 193–205, 2003.

[25] X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: discoveringconserved DNA motifs in upstream regulatory regions of co-expressedgenes,” in Pac. Symp. Biocomput., vol. 6, 2001, pp. 127–138.

[26] E. Wijaya, K. Rajaraman, S.-M. Yiu, and W.-K. Sung, “Detection ofgeneric spaced motifs using submotif pattern mining,” Bioinformatics,vol. 23, no. 12, pp. 1476–1485, 2007.

[27] T.-M. Chan, K.-S. Leung, K.-H. Lee, and P. Lio’, “Generic spaced dnamotif discovery using genetic algorithm,” in Evolutionary Computation(CEC), 2010 IEEE Congress on, july 2010, pp. 2647–2654.

[28] Research Collaboratory For Structural Bioinformatics, “RCSB PDBAnnual Report July 2009,” http://www.rcsb.org/pdb/, p. 4, December2009.

[29] G. Tuteja, S. T. Jensen, P. White, and K. H. Kaestner, “Cis-regulatorymodules in the mammalian liver: composition depends on strength ofFoxa2 consensus site,” Nucl. Acids Res., vol. 36, no. 12, pp. 4149–4157,Jul. 2008. [Online]. Available: http://dx.doi.org/10.1093/nar/gkn366

[30] A. E. Kel, Y. Tikunov, N. Voss, J. Borlak, and E. Wingender, “Appli-cation of kernel method to reveal subtypes of tf binding motifs,” inRegulatory Genomics 04, 2004, pp. 42–51.

[31] M. J. Mason, K. Plath, and Q. Zhou, “Identification of context-dependentmotifs by contrasting ChIP binding data.” Bioinformatics (Oxford,England), vol. 26, no. 22, pp. 2826–2832, Nov. 2010.

[32] A. S. Bais, N. Kaminski, and P. V. Benos, “Finding subtypes oftranscription factor motif pairs with distinct regulatory roles,” NucleicAcids Research, vol. 39, no. 11, p. e76, Jun. 2011.

[33] C. M. Klinge, “Estrogen receptor interaction with estrogen responseelements,” Nucleic Acids Res., vol. 29, pp. 2905–2919, 2001.

[34] A. E. Kel, O. V. Kel-Margoulis, P. J. Farnham, S. M. Bartley, E. Wingen-der, and M. Q. Zhang, “Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors,” J. Mol. Biol.,vol. 309, no. 1, pp. 99–120, 2001.

[35] E. Blanco, D. Farre, M. M. Alba, X. Messeguer, and R. Guigo, “ABS:a database of annotated regulatory binding sites from orthologouspromoters,” Nucleic Acids Res., vol. 34, pp. D63–D67, 2006.

[36] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” BiometricsBulletin, vol. 1, no. 6, pp. 80–83, 1945.

[37] X. Chen, H. Xu, P. Yuan, F. Fang, M. Huss, V. B. Vega, E. Wong, Y. L.Orlov, W. Zhang, J. Jiang, Y.-H. Loh, H. C. Yeo, Z. X. Yeo, V. Narang,K. R. Govindarajan, B. Leong, A. Shahab, Y. Ruan, G. Bourque, W.-K. Sung, N. D. Clarke, C.-L. Wei, and H.-H. Ng, “Integration ofExternal Signaling Pathways with the Core Transcriptional Network inEmbryonic Stem Cells,” Cell, vol. 133, no. 6, pp. 1106–1117, Jun. 2008.

[38] G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner, “WebLogo:A Sequence Logo Generator,” Genome Research, vol. 14, no. 6, pp.1188–1190, Jun. 2004.

2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 205