[ieee 2013 ieee symposium on computational intelligence in bioinformatics and computational biology...
TRANSCRIPT
Genetic Algorithm for Dimer-led andError-restricted Spaced Motif Discovery
Tak-Ming Chan∗§, Leung-Yau Lo∗, Man-Leung Wong†, Yong Liang‡ and Kwong-Sak Leung∗∗Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong
Email: {tmchan, lylo, ksleung}@cse.cuhk.edu.hk†Department of Computing and Decision Sciences, Lingnan University, Hong Kong
Email: [email protected]‡Faculty of Information Technology, Macau University of Science & Technology, Macau
Email: [email protected]§Department of Integrative Biology and Physiology, UCLA, USA
Email: [email protected]
Abstract—DNA motif discovery is an important problem fordeciphering protein-DNA bindings in gene regulation. To discovergeneric spaced motifs which have multiple conserved patternsseparated by wild-cards called spacers, the genetic algorithm(GA) based GASMEN has been proposed and shown to out-perform related methods. However, the over-generic modelingof any number of spacers increases the optimization difficultyin practice. In protein-DNA binding case studies, complicatedspaced motifs are rare while dimers with single spacers aremore common spaced motifs. Moreover, errors (mismatches) ina conserved pattern are not arbitrarily distributed as certainhighly conserved nucleotides are essential to maintain bindings.Motivated by better optimization in real applications, we havedeveloped a new method, which is GA for Dimer-led and Error-restricted Spaced Motifs (GADESM). Common spaced motifs arepaid special attention to using dimer-led initialization in the pop-ulation initialization. The results on real datasets show that thedimer-led initialization in GADESM achieves better fitness thanGASMEN with statistical significance. With additional error-restricted motif occurrence retrieval, GADESM has shown betterperformance than GASMEN on both comprehensive simulationdata and a real ChIP-seq case study.
I. INTRODUCTION
In this section, spaced motif discovery is first introduced,
followed by our motivations and the paper layout.
A. Spaced Motif Discovery
Transcription Factors (TFs) are regulatory proteins binding
to certain short DNA segments (substrings) to control gene
transcriptional expressions. The short DNA substrings rec-
ognized and bound by TFs are called Transcription Factor
Binding Sites (TFBSs), which are mostly ≤ 25 characters,
or base pairs (bp), in length. Identification of TFBSs is
an important problem for understanding gene regulation in
biology. The DNA binding domains of a TF can recognize and
bind to a collection of similar TFBSs, from which a conserved
pattern called motif can be obtained.By exploiting intergenic DNA regions of co-regulated genes
or sequences from high-throughput Chromatin immunopre-
cipitation (ChIP) experiments [1], de novo motif discovery
using computational methods have been proposed to identify
the conserved and over-represented patterns. Motif discovery
serves as an attractive pre-screening alternative to costly
biological experiments, produces novel putative TFBS motifs
for further verification, and provides significant insights into
understanding regulatory mechanisms. There are three major
components for a (de novo) motif discovery algorithm: the
representation of the motif including its occurrence retrieval
criteria (e.g. a motif represented as a string consensus and
occurrences retrieved based on minimal Hamming distance);
the evaluation function to rank and choose top motifs to best
capture biological properties; and search/optimization to tackle
the NP-hardness [2] effectively and efficiently.
The numerous motif discovery algorithms can be catego-
rized by their representation methods: the consensus string
[3] and the position weight matrix (PWM) [4]. A consensus
string of the DNA is simple and can be feasibly enumerated
up to a certain length (e.g. ≤ 12 in Weeder [3]). A PWM
shows the quantitative frequencies or weights of nucleotides
in the motif. A PWM is usually generated from a set of
occurrences using consensus based methods or sampling.
The evaluation functions in general measure conservation and
over-representation, i.e. similarity and high frequency of the
motif occurrences. Representative evaluation functions include
Information Content (IC) [4], maximum a posterior (MAP)
[5] and Bayesian scores [6]. Besides local search [5] and
single-point sampling [7], heuristic methods, in particular
evolutionary computation and genetic algorithms (GAs) [8]–
[14], have shown to be promising in optimization capability.
They also adopt more general models for handling uncertain
motif widths and occurrence abundance [15]–[17]. Most com-
putational methods are within the above scope, discovering
motifs as single contiguous conserved patterns, which can be
referred to as monad motifs.
For a generic motif, a portion of DNA segments are with
chemical contacts to TFs, which are most conserved and
specific to form contiguous conserved patterns (referred to
as binding cores [18]–[20]). On the other hand, conservation
is not as critical in portions between binding cores (the so-
called gaps or spacers). Common spaced motifs are dyads
where two conserved patterns (i.e. two monad motifs) are
198978-1-4673-5875-0/13/$31.00 c©2013 IEEE
separated by one spacer [21], [22]. There are also spaced
motifs [23], [24] resulting from common machinery in gene
regulation called dimerization. In particular, two identical TFs
(homodimers, e.g. [23]) or two different but structurally similar
TFs (heterodimers, e.g. [24]) can bind together and then bind
to the target TFBS. In such cases, the two segments are
likely to be reverse complements of each other [23], [24],
and such dimers are also called palindromic (dimer) motifs.
For example, CAGGTnnACCTG is a palindromic motif where
CAGGT and ACCTG are reverse complements: CAGGT be-
comes TGGAC after string reverse, and then ACCTG after
applying complement rules A ↔ T, C ↔ G.
There are fewer algorithms designed for spaced motifs,
and they either require gaps in a motif to be of the same
fixed width, or handle the dyad with only 1 gap [21], [22],
[25]. SPACE [26], a consensus-based method, was proposed to
discover more generic spaced motifs with flexible gap numbers
and ranges. SPACE was shown to outperform the other spaced
motif algorithms [22], [25] on various real and benchmark
datasets. However, because of the time complexity of frequent
itemset mining employed, constraints are imposed in SPACE:
all candidate motifs are restricted to be derived exactly from
the input occurrences, while the optimal motif consensus may
not be any of them. The computational time can also be
unbounded. The recent GA based method GASMEN [27] has
shown considerably improved performance over SPACE for
both short monad and generic spaced motifs.
With the motivations to apply novel GA on generic spaced
motif discovery with flexible width ranges, GASMEN (Ge-
netic Algorithm for Spaced Motif Elicitation on Nucleotides)
has been developed [27], which has better generality and
performance than SPACE. GASMEN searches consensuses
with a wide range of possible widths (up to 25bp) and
relaxes substantial constraints of SPACE. GASMEN employs
submotif indexing to partition the search space into smaller
sub-spaces for GA to more likely reach optimality. In the GA,
multiple-motif control is employed and probabilistic refine-
ments are proposed to improve motif quality. The experimental
results on real spaced motif datasets show that GASMEN is
able to find motifs more accurately than SPACE. GASMEN
is also capable of finding monad motifs, outperforming both
Weeder [3] and SPACE on most of the 8 real datasets.
B. Motivations
Despite GASMEN’s encouraging results, biological phe-
nomena can be modeled and incorporated in the methodology
for better effectiveness in real practice. In particular, the over-
generic spaced motif modeling can be improved by specifically
modeling naturally common spaced motifs, e.g. dyads and
palindromic dimers, during population initialization. Error-
restricted effects on different motif positions can also be
formulated to better filter out false motif occurrences. They
are elaborated as follows:
1) Dimer-led initialization: A dimer, which consists of one
in-between spacer and two conserved patterns, is a common
spaced motif in real data. If the two conserved patterns are
reverse complements of each other, it is a palindromic dimer.
The reason is that a palindromic motif is simultaneously
bound by two identical or structurally similar TFs on the
double strands symmetrically, which can be clearly shown
in 3D structures [23], [24], [28]. Therefore, in population
initialization of a GA for spaced motif discovery, it is more
cost-effective and realistic to generate certain amount of dimer-
led motifs, while generic spaced motifs with arbitrary spacers
should also be considered for both theoretical and potential
completeness. This can considerably reduce the search space
and optimization difficulty for discovering real biological
motifs. On the other hand, considering the complexity of the
problem nature, we cannot only consider dimer-led motifs but
also have to cover generic spaced motifs. In summary, dimer
motifs should receive special attention as they are common
spaced motifs while generic spaced motifs should also be
covered. This motivation can be realized with well-designed
population initialization and genetic operators in GA based
methods.
2) Error restriction: In recent TF-TFBS binding sequence
association studies [18]–[20], it is observed that in the con-
served patterns referred to as binding cores, some positions
are more flexible to mutate while the others are critical to be
conserved to maintain the chemical bonds [20]. As a result,
the occurrence mismatches/errors from the motif consensus are
clustered rather than evenly distributed. Furthermore, there are
real cases of motif variations in particular positions called sub-
types [29], which can be exploited by computational methods
on some specific monad and dyad cases [30]–[32]. Therefore,
introducing error restriction strategies to penalize scattered
mismatch positions can potentially remove false motif occur-
rences. None of the recent consensus-based methods [3], [26],
[27] consider error-restricted motif occurrences. As a result,
the motif quality may be affected on real-data applications,
especially on high-throughput ChIP-chip and ChIP-seq data
[1] where up to thousands of motif occurrences are to be
identified accurately.
C. Paper Outline
Motivated by better modeling and optimization performance
in real-data applications, we develop the novel Genetic Al-
gorithm for Dimer-led and Error-restricted Spaced Motifs
(GADESM), which introduces dimer-led population initial-
ization and motif occurrence error-restriction to GASMEN.
Dimers are specially handled in initialization while generality
on evolving generic spaced motifs is maintained (see results).
Error-restricted effects are considered to remove potential false
motif occurrences. The detailed methodology of GADESM is
elaborated in Section II. Experimental results on comprehen-
sive simulated and real datasets are reported in Section III.
Discussion and conclusion are available in Section IV.
II. METHODS
In this section, spaced motifs are first introduced in the
context of dimers and error restriction, followed by the
methodology of GADESM.
2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 199
A. Error-restricted Spaced Motifs
We first introduce the generic spaced motif definitions
employed by SPACE [26] and GASMEN [27], and then further
elaborate dimer motifs and error-restricted spaced motifs.
1) Generic Spaced Motifs and Dimers: A generic DNA
spaced motif (or simply a motif) M is a width W string
formed by characters of {A, C, G, T, n}, where each maximal
substring of consecutive “n”s represents a gap (or spacer)
and each maximal contiguous substring of other characters
represents a conserved pattern called segment. To avoid trivial
conserved patterns, each segment should be ≥ w for a
predefined w, which is the minimal segment width. Any w-(length) segment without “n” is called a submotif. We set
w = 4 and W = 25 in the experiments.
A dyad motif, or simply a dimer, is a special case of the
spaced motif with exactly two segments separated by one
spacer. The two segments are independent and may be in
different lengths.
A palindromic dimer (motif) is a special case of the
dyad/dimer, as well as the spaced motif, with exactly two
segments each being the reverse complement (i.e. palindromic)
of the other, separated by one spacer. M in the follow-
ing illustrative example is a spaced motif, a dyad and a
palindromic dimer, because the two segments CAGTCA and
TGACTG are palindromic, separated by one spacer with 6
“n”s. During motif discovery, palindromic dimers can be more
easily detected than generic spaced motifs in a data-driven
manner via sampling from the input substrings.
2) Error-restricted Motif Occurrence: In GASMEN and
SPACE, for a defined spaced motif M , a substring is called its
occurrence O if for every submotif in M , the corresponding
non-n w-substring in O is within Hamming distance d = 1,
i.e. every sliding w-segment within a conserved pattern is
with at most 1 error. Note that the spacers (“n”) portions
are not considered as errors. The following example with
w = 4,W = 25 illustrates the 8 occurrences for a given
spaced motif M , where the effective motif width is 18 (the 7
trailing “n”s trimmed for a practical motif), and the errors in
the occurrences are underlined.
• M=CAGTCAnnnnnnTGACTGnnnnnnn• O1=CAGTGAccacgcTCACTC• O2=CAGTCTggtgcgTGTCTG• O3=CAGTCAtactgaTGACTG• O4=GAGTCGatacttTGTCTG• O5=CAGTCTgggataTGACTG• O6=CTGTCTtgcaagGGACTT• O7=CAGTCAtactgaTGACTG• O8=CAGTCAtactgaTGACTG
According to the uneven error distribution mentioned, some
of these occurrences may be false positives, as only the max-
imal error (Hamming distance d = 1) is specified but not the
error distributions. On the other hand, it is difficult to anticipate
the error distribution of the motif occurrences in advance in
motif discovery. To alleviate the problem, a two-stage data-
driven method is proposed for error-restricted motif occurrence
retrieval. The idea is that the majority of the occurrences of a
generic spaced motif should be true positive occurrences after
the Hamming distance filtering. Only a few false positives are
left with errors scattering around. Assigning penalties inversely
proportional to the error frequencies can effectively remove
possible false positives, because they exhibit errors in positions
beyond the error-restricted (error-clustered) positions.
The method is elaborated as follows. Given a motif M , its
occurrences are first retrieved according to the generic spaced
motif definitions. With the first-stage occurrences, an error
map is established for the error frequencies e(i) over different
positions in the segments:
e(i) =no. of errors
no. of occurrences(1)
where i is the ith position in a segment. In the illustrative
example above, e(1) = 1/8 because O4 has an error (a
mismatch) G compared to C in the first position of M .
Similarly, e(2) = 1/8 and e(6) = 4/8. For e(i), an error-
restricted penalty is proposed as pen(i) = 1− e(i). The error
frequency for “n” can be set as 1, and therefore there will be
no penalty as a spacer can match any character. Table I shows
part of the error map and error penalties of the illustrative
example.
TABLE I: Illustration of the Error Map and Error Penalties
M C A G T C A n n ...i 1 2 3 4 5 6 7 8 ...
e(i) 1/8 1/8 0/8 0/8 1/8 4/8 8/8 8/8 ...pen(i) 7/8 7/8 8/8 8/8 7/8 4/8 0/8 0/8 ...
The previous occurrences are re-evaluated according to their
total penalties within a segment, if the total penalty of any
segment is > 1.0, the occurrence is eliminated in the error-
restricted occurrence retrieval. Therefore, O1, O4 and O6 are
eliminated, where the total penalties for the two segments are
shown after the occurrences below (underlined for penalties
> 1.0).
• M=CAGTCAnnnnnnTGACTGnnnnnnn• O1=CAGTGAccacgcTCACTC 0.875 1.625• O2=CAGTCTggtgcgTGTCTG 0.500 0.750• O3=CAGTCAtactgaTGACTG 0.000 0.000• O4=GAGTCGatacttTGTCTG 1.375 0.750• O5=CAGTCTgggataTGACTG 0.500 0.000• O6=CTGTCTtgcaagGGACTT 1.375 1.625• O7=CAGTCAtactgaTGACTG 0.000 0.000• O8=CAGTCAtactgaTGACTG 0.000 0.000
With the error-restricted occurrence retrieval, occurrences
with errors scattering around are penalized because they are
more likely to be false positives. Based on the occurrence
statistics obtained from the generic spaced motif, penalties
can be generated in a data-driven (occurrence based) manner
without directly enumerating all possible error distributions.
The final spaced motif and its filtered occurrences are more
conserved and better match the biological observations. The
procedure is integrated in the evaluation part of GADESM.
200 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
B. Dimer-led Population Initialization
Generic spaced motifs impose a huge pattern space for
search and optimization. Therefore, submotif indexing has
been employed by GASMEN [27]. In particular, all substrings
in the input dataset are indexed according to w submotif in-
dices: AAAA, AAAC, AAAG, ..., TTTT. GA is then applied to
each submotif index individually. The partitioning allows the
GA to have higher chance to achieve optimality in smaller sub-
space. The same submotif indexing is adopted in GADESM.Different from the hybridized population initialization in
GASMEN [27] where half of the population is explicitly
initialized to be monad motifs and half generic spaced motifs,
we spare a proportion of initial population for dimers in
GADESM. Given a submotif index (e.g. ACCG), an individual
has 1/3 chance to be initialized as a monad motif, a generic
dyad motif, or a palindromic dimer, respectively. The monad
case is the same as that in GASMEN. The dyad case is a
special case of the generic spaced motif where there are only
two segments separated by a randomly generated spacer. Note
that the two segments are independent and can be in different
lengths for a dyad.The last palindromic dimer case is handled in a data-driven
manner as mentioned before. It is difficult to directly generate
a palindromic dimer motif as there is a huge search space
for segment pattern, segment length and spacer length. On
the other hand, if a palindromic dimer motif does exist in
the input dataset, it is easy to find some occurrences with two
approximate palindromic substrings. Therefore, for a submotif
index (e.g. ACCG), we randomly select a substring belonging
to it, e.g. ACCGTACATACGGT..., and then find the maximal
substring that can be a reverse complement (palindrome) of the
extended submotif index within maximal error d = 1. In this
case, ACGGT is an exact palindrome of ACCGT, one character
extended from the index ACCG. The motif individual is
then initialized as ACCGTnnnnACGGTnnn... If the current
substring does not have any approximate palindrome, the motif
individual is then initialized to be a monad motif randomly as
a degenerate case.One may note that the population initialization in GADESM
does not explicitly produce more complicated generic spaced
motifs with more than two segments. This does not mean
GADESM prohibits generic spaced motifs to be evolved as the
genetic operators and probabilistic refinement adopted from
GASMEN are able to produce more complicated spaced motifs
if they are the cases with higher fitness. As a result, the
dimer-led population initialization in GADESM pays special
attention to the common dimer motifs, while together with
the other GA components generic spaced motifs can still be
found. This is also verified by the overall better performance
in the simulation experiments with 3-segment (3block) generic
spaced motifs.
C. The Overall GADESM
To highlight the contributions of dimer-led population ini-
tialization and error-restricted occurrence retrieval, GADESM
adopts the same other procedures of GASMEN [27], which
TABLE II: The pseudo-code of GADESM
Motif width W , submotif width w, distance d,motif number n, difference threshold α
Submotif indexingfor each submotif index {
Dimer-led Population Initialization for monad, dyad, palindromic dimerfor each generation g {
Same GA procedures as GASMEN [27]Evaluation (σ(M)) with Error-restricted Occurrence Retrieval
}}Output in the top final motifs
serves as a powerful platform for GA based spaced motif
discovery. In particular, to fairly evaluate the optimization
capability of GADESM, we use the same evaluation function
σ for both GASMEN and GADESM, which considers the best
occurrence, if any, on each sequence:
For each input sequence {Si} and a candidate motif M , we
consider the most conserved (error-restricted) occurrence of
M in each sequence Si and let ei be the Hamming distance
of this best occurrence. 1/N(Si) represents the frequency of
the best occurrence in sequence Si, where N(Si) is its total
count of characters. E(M, ei) is the expected frequency that
occurrences of M come from the non-motif background. The
frequency can be calculated using pre-computed background
statistics based on Markov chains [3], [26], [27]. Thus the log
relative frequency ratio σ(M) between all best occurrences of
M and the background is defined as:
σ(M) =∑
i
log1
E(M, ei) ∗N(Si). (2)
If the pattern is very conserved and/or its best occurrences
are in many sequences, σ(M) is large. Note that the evaluation
function is suitable for both monad and spaced motifs.
The overall GADESM approach is illustrated in Table II
where the new contributions, namely the Dimer-led Popu-lation Initialization and Error-restricted Occurrence Re-trieval are shown in bold. More details of the adopted features
can be found in GASMEN [27].
III. EXPERIMENTAL RESULTS
In this section, we compare GADESM and GASMEN on
both simulation and real experiments to evaluate the dimer-
led population initialization and error-restricted occurrence re-
trieval. In all the experiments, both GADESM and GASMEN
were set with the same parameters. The GA population size
was 100, mutation rate was 0.5, generation number g = 100,
and unchanged generation count for convergence was 10.
A. Simulation Experiments
We first perform comprehensive and challenging simulation
experiments to verify the performance of GADESM method
against GASMEN.
2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 201
1) Simulation Data Generation: We generated simulation
data with four different types of motifs (monad, 3block, dimerand rdimer), to evaluate the performance difference between
GASMEN and GADESM. The different motif types and their
patterns are listed in Table III, where a string of X’s means a
specific conserved pattern (segment) will be randomly gener-
ated, denoted as a block specifically for the simulation part.
Strings of “n”s are spacers. So monad is a single conserved
block (segment); 3block consists of three blocks with two
spacers; dimer and rdimer both have two blocks with a spacer
in between. In dimer (dyad), the two segments are different
patterns, while in rdimer, the spaced motif is palindromic, i.e.
the two segments are reverse complement of each other. They
simulate dimers for common dyads and palindromic dimers
respectively.
TABLE III: Motif Types Generated
Type Patternmonad XXXXXXXX3block XXXXnnXXXXXXnnnnXXXXXXdimer XXXXXnnnnXXXXXrdimer XXXXXnnnnXXXXX
We also simulated the clustered errors as follows: for real
motif occurrences, within each segment, 40% (round up)
of the positions are randomly chosen to allow substitution
errors, e.g. in this rdimer XeeXXnnnnXeXeX, only positions
2 and 3 of the first segment, and positions 2 and 4 of the
second segment are allowed to have substitution errors in
its real occurrences. In each real occurrence, each segment
independently has a probability ε of having a substitution
error in one of the allowed positions. We also generated
fake motif occurrences (false positives), where some segments
contain substitution errors out of the allowed positions within
the segment. Therefore the fake motifs occurrences are less
conserved than the real motif occurrences, and the real motif
occurrences have substitution errors clustered in only certain
positions within a segment.
For each of the four motif types, we tested ε = 0.5 and
ε = 0.8. For each setting, we generated 10 random datasets.
In each dataset, we generated 100 sequences, 80 of which
contained motif occurrences at uniformly randomly chosen
positions, and the remaining 20 were only the background
sequences. Then we randomly chose the positions within the
motif where substitution errors were allowed, such that the
motif occurrences containing substitution errors in disallowed
positions were fake motifs, and the remaining occurrences
were true occurrences. Under this method of generation, the
number of true occurrences is relatively small, mimicking the
challenging situation of having noisy background containing
patterns similar to the true motifs. The average number of true
occurrences in each setting is listed in Table IV. The back-
ground was generated according to uniform multinomial dis-
tribution. We have therefore generated a total of 80 (4x2x10)
simulation datasets. We ran GASMEN and GADESM on these
datasets with the same parameter settings.
TABLE IV: Average Number of True Motif Occurrences
monad 3block dimer rdimerε = 0.5 59.5 32.8 40.8 38.1ε = 0.8 48.2 16.6 22.3 22.6
2) Performance Evaluation Metrics: We employ the stan-
dard performance evaluation metrics as listed in Table V,
namely recall, precision and f-measure on both the site-level
(prefix s−) and nucleotide-level (prefix n−). For site-level, a
prediction is correct if and only if it covers at least 40% of
the actual motif occurrence.
TABLE V: Performance Metrics
Metric Recall Precision F-measureMeaning TP
TP+FNTP
TP+FP2∗Recall∗PrecisionRecall+Precision
3) Simulation Results: Figure 1 shows the boxplots of
the average performance comparisons between GASMEN and
GADESM for ε = 0.8. Because of space limit, here we show
only the nucleotide-level metrics and site-level f-measure,
because the results on both levels are consistent as illustrated
by Figures 1c and 1d. The results for ε = 0.5 are qualitatively
similar, leading to the same conclusion.
In Figure 1a, GADESM shows considerably better preci-
sion than GASMEN because the error-restricted occurrence
retrieval can effectively remove deceptive fake occurrences.
On the other hand, GADESM maintains comparable recall as
shown in Figure 1b except for the 3block type cases. Overall
speaking, GADESM has better performances than GASMEN
with respect to f-measure in all data types. The differences
are larger in the common dyad (dimer) and palindromic dimer
(rdimer) cases, indicating the dimer-led population initializa-
tion can considerably improve the GA optimization capacity in
these cases. On the other hand, GADESM also demonstrates
better performance in generic spaced motif (3block) and
monad cases. As a result, the dimer-led population initial-
ization and error-restricted occurrence retrieval for GADESM
have been verified on the comprehensive and challenging
simulation experiments.
B. Real Experiments for Dimer-led Initialization
Besides the simulation data, we also evaluate the dimer-
led population initialization alone (i.e. without error-restricted
occurrence retrieval) on real datasets in terms of optimization
capacity.
1) 8 Real Benchmark Datasets: The 8 real benchmark
datasets detailed in [15], [27] were employed to test the
optimization capacity of GADESM and GASMEN. The 8
datasets cover different motif properties: motif widths from
6 to 22, sequence lengths from 105 to over 300, and sequence
numbers from 17 to 95. Among the 8 datasets, the CRP (cyclic
AMP receptor protein) binding site motif in E. coli is a spaced
motif with width 22, which contains two weakly conserved
monad motifs separated by a gap [4]. The ERE dataset con-
tains binding sites called estrogen response elements (EREs)
202 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
Fig. 1: Comparison of GASMEN and GADESM on Simulation Data, ε = 0.8
(a) Nucleotide-level Precision (b) Nucleotide-level Recall
(c) Nucleotide-level F-measure (d) Site-level F-measure
with high affinity and activates gene expression in response
to estradiol [33]. The E2F family [34] binding sites are from
mammalian sequences. The five additional datasets for the TFs
of CREB, MEF2, MYOD, SRF and TBP are from the ABS
eukaryotic database [35].2) Comparisons and Statistical Significances: In this eval-
uation, only the population initialization methods are different
for GADESM and GASMEN. We investigate the optimization
results in terms of the final fitness in top ranked motifs reported
by the two methods. For each of the 8 datasets, GADESM and
GASMEN were run for 30 times, and the averaged top motif
fitness scores with the standard deviations (SD) are shown
in Table VI. GADESM not only achieves better averaged
fitness scores in all of the datasets, but also shows lower SD
in most cases. To evaluate the statistical significance of the
averaged top reported fitness scores, the one-sided Wilcoxon
rank-sum test (Mann-Whitney U test) [36] was performed with
the p-value threshold p ≤ 0.05. The results for all 8 datasets
are shown to be significant as shown in Table VI. Without
any knowledge which datasets have palindromic motifs or
not, the dimer-led population initialization generates more
realistic candidate motifs (individuals) in real practice for
motif discovery. Provided that the same evaluation function,
same genetic components and probabilistic refinement are
adopted, the population initialization method of GADESM
shows significantly better optimization results compared with
GASMEN.
C. Real ChIP-seq Case Study with Error-restriction
Lastly, we demonstrate a real ChIP-seq dataset case study
where the error-restricted occurrence retrieval of GADESM
would potentially improve GASMEN’s motif quality.
1) c-Myc Dataset: The c-Myc dataset is a ChIP-seq dataset
chosen from the recent stem cell study [37]. It contains
2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 203
Data GADESM GASMEN p-valueCREB 60.248± 2.642 59.278± 2.701 0.003417CRP 54.608± 1.334 48.460± 6.340 0.000003E2F 87.813± 0.000 87.503± 0.509 0.000063ERE 83.519±11.218 66.886±10.560 0.000003
MEF2 65.64± 4.893 62.318± 4.719 0.006691MYOD 60.515± 4.226 56.077± 2.581 0.000002
SRF 87.448± 4.193 79.953± 4.974 0.000000TBP 118.839± 4.800 113.728± 6.324 0.000366
TABLE VI: Comparisons of the averaged top reported fitness
scores between GADESM and GASMEN on the 8 real bench-
mark datasets. ± indicates SD.
3418 sequences each around 100bp in length. We selected
this dataset because the binding properties of the Myc TF
and its patterns are well studied. The TFBS motif for the
oncogenic Myc family is called the E-box (Enhancer Box)
with consensus CAnnTG, where a palindromic canonical motif
is CACGTG. Previous research literature has discussed about
the binding preferences of the binding motifs: CACGTG for
Myc, of which the 3D TF-TFBS binding structure is known
[24]. Therefore, without any error-restricted control for a motif
discovery algorithm, false positives similar to the Myc motif
CACGTG, e.g. occurrences like nACGTn, may be included
and thus affect the accuracy for the Myc motif. Importantly,
such subtle difference may affect the power to distinguish Myc
motifs from other motifs.
2) Error-restriction on c-Myc: With GASMEN tested on
the c-Myc dataset, we demonstrate that the discovered motif
occurrences could be further refined using error-restricted
occurrence retrieval of GADESM. GASMEN successfully
identified the CACGTG c-Myc motif shown in Figure 2a.
However, a number of occurrences deviate significantly from
the canonical motif, and only 1679 out of the 2847 discovered
occurrences are matched with the regular expression CAnnTG.
The error-restricted occurrence retrieval of GADESM was then
applied, resulting in the error position penalties for AGCGTG:
0.93 0.84 0.87 0.87 0.86 0.94. Occurrences very likely to
be false positives according to the binding properties were
removed, and the pattern generated from these occurrences
is shown in Figure 2c). As a result, an improved and more
conserved CACGTG motif was output, where the underlined
C and G are more conserved, with 2769 occurrences (Figure
2b).
IV. DISCUSSION AND CONCLUSION
Existing generic spaced motif discovery methods are either
limited or over-generic in modeling. Most consensus based
motif discovery algorithms do not have special motif occur-
rence retrieval where false positives are likely to be introduced.
Motivated by our biological observations and better optimiza-
tion performance, we have developed the new GADESM (GA
for Dimer-led and Error-restricted Spaced Motifs) to discover
spaced motifs, based on the successful GA-based GASMEN.
With the dimer-led population initialization, common spaced
motifs such as dyads and palindromic dimers are paid more
attention while the generality on evolving generic spaced
motifs is still maintained by adopting the genetic operators in
GASMEN. With error-restricted occurrence retrieval, motif oc-
currences are re-evaluated in order to eliminate false positives
with errors scattering around. The results on real datasets have
shown that the dimer-led initialization in GADESM achieves
statically significant better fitness than that in GASMEN.
Additionally, with error-restricted motif occurrence retrieval,
GADESM has shown better performance than GASMEN in
spaced motif discovery on synthetic and real data. Further-
more, the error-restricted occurrence retrieval is potentially
useful for improving motif quality on ChIP-seq data.
In summary, we have focused on improving the population
initialization and motif occurrence retrieval, and they have
achieved very promising results compared with GASMEN.
However, in the future, we will further investigate into the
genetic operators, especially the local operators for overall
improvement in the GA for real spaced motif discovery
applications. The overhead introduced by the two-stage error-
restricted occurrence retrieval will be reduced by designing
more advanced one-round error-restricted occurrence retrieval
methods. The submotif indexing will be revisited for better
efficiency as there are now a large number of indices and GA
has to run once on each, but the trade off in performance has
to be carefully handled.
ACKNOWLEDGMENT
This research is partially supported by the Direct Grant
of CUHK, the General Research Fund (Project Number:
LU310111) of Hong Kong SAR, China, and the Macau Sci-
ence and Technology Develop Fund (Grant No. 017/2010/A2)
of Macau SAR, China.
REFERENCES
[1] A. D. Smith, P. Sumazin, D. Das, and M. Q. Zhang, “Mining ChIP-chipdata for transcription factor and cofactor binding sites,” Bioinformatics,vol. Suppl 1, no. 20, pp. i403–i412, 2005.
[2] M. Li, B. Ma, and L. Wang, “Finding similar regions in many se-quences,” Journal of Computer and System Sciences, vol. 65, pp. 73–96,2002.
[3] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder web:discovery of transcription factor binding sites in a set of sequences fromco-regulated genes,” Nucleic Acids Res., vol. 32, pp. W199–W203, 2004.
[4] G. D. Stormo, “Computer methods for analyzing sequence recognitionof nucleic acids,” Annu. Rev. BioChem., vol. 17, pp. 241–263, 1988.
[5] T. L. Bailey, “Fitting a mixture model by expectation maximizationto discover motifs in biopolymers,” in Proceedings of the SecondInternational Conference on Intelligent Systems for Molecular Biology.AAAI Press, 1994, pp. 28–36.
[6] S. T. Jensen, X. S. Liu, Q. Zhou, and J. S. Liu, “Computational discoveryof gene regulatory binding motifs: a bayesian perspective,” StatisticalScience, vol. 19, no. 1, pp. 188–204, 2004.
[7] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. DeMoor, P. Rouze,and Y. Moreau, “A gibbs sampling method to detect overrepresentedmotifs in the upstream regions of coexpressed genes,” J. Comput. Biol.,vol. 9, pp. 447–464, 2002.
[8] D. Wang and X. Li, “GAPK: genetic algorithms with prior knowledgefor motif discovery in DNA sequences,” in CEC’09: Proceedings ofthe Eleventh conference on Congress on Evolutionary Computation.Piscataway, NJ, USA: IEEE Press, 2009, pp. 277–284.
[9] T.-M. Chan, K.-S. Leung, and K.-H. Lee, “TFBS identification basedon genetic algorithm with combined representations and adaptive post-processing,” Bioinformatics, vol. 24, no. 3, pp. 341–349, 2008.
204 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
Fig. 2: The c-Myc motif discovered without and with Error-restriction. The red lines indicating the highest information content
show that first C and last G are more conserved. Motifs are generated using WebLogo [38]
(a) Without Error-restriction (b) With Error-restriction of GADESM
(c) The Pattern of Occurrences Removed Using Error-Restriction
[10] M. A. Lones and A. M. Tyrrell, “A co-evolutionary framework forregulatory motif discovery,” in Evolutionary Computation, 2007. CEC2007. IEEE Congress on, 2007, pp. 3894–3901.
[11] M. Stine, D. Dasgupta, and S. Mukatira, “Motif discovery in upstreamsequences of coordinately expressed genes,” in Evolutionary Computa-tion, 2003. CEC ’03. The 2003 Congress on, vol. 3, 2003, pp. 1596–1603Vol.3.
[12] T. K. Paul and H. Iba, “Identification of weak motifs in multiple biolog-ical sequences using genetic algorithm,” in GECCO ’06: Proceedingsof the 8th annual conference on Genetic and evolutionary computation,2006, pp. 271–278.
[13] L. Li, “GADEM: A Genetic Algorithm Guided Formation of SpacedDyads Coupled with an EM Algorithm for Motif Discovery,” Journalof Computational Biology, vol. 16, no. 2, pp. 317–329, Feb. 2009.
[14] J. wei Luo and T. Wang, “Motif discovery using an immune geneticalgorithm,” Journal of Theoretical Biology, vol. 264, no. 2, pp. 319 –325, 2010.
[15] Z. Wei and S. T. Jensen, “GAME: detecting cis-regulatory elementsusing a genetic algorithm,” Bioinformatics, vol. 22, no. 13, pp. 1577–1584, 2006.
[16] T.-M. Chan, G. Li, K.-S. Leung, and K.-H. Lee, “Discoveringmultiple realistic tfbs motifs based on a generalized model,” BMCBioinformatics, vol. 10, no. 1, pp. 321+, October 2009. [Online].Available: http://dx.doi.org/10.1186/1471-2105-10-321
[17] T.-M. Chan, K.-S. Leung, and K.-H. Lee, “Memetic algorithms for denovo motif discovery,” Evolutionary Computation, IEEE Transactionson, vol. 16, no. 5, pp. 730 –748, oct. 2012.
[18] K.-S. Leung, K.-C. Wong, T.-M. Chan, M.-H. Wong, K.-H. Lee, C.-K. Lau, and S. K. W. Tsui, “Discovering protein-dna binding sequencepatterns using association rule mining,” Nucleic Acids Res, vol. 38, pp.6324–6337, 2010.
[19] T.-M. Chan, K.-C. Wong, K.-H. Lee, M.-H. Wong, C.-K. Lau, S. K.Tsui, and K.-S. Leung, “Discovering approximate-associated sequencepatterns for protein-DNA interactions,” Bioinformatics, vol. 27, no. 4,pp. 471–478, Feb. 2011.
[20] T.-M. Chan, K.-S. Leung, K.-H. Lee, M.-H. Wong, T. C.-K. Lau, andS. K. W. Tsui, “Subtypes of associated protein-dna (transcription factor-transcription factor binding site) patterns,” Nucleic Acids Res, vol. 40,no. 19, pp. 9392–9403, 2012.
[21] S. Sinha and M. Tompa, “Ymf: a program for discovery of noveltranscription factor binding sites by statistical overrepresentation,” Nucl.Acids Res., vol. 31, no. 13, pp. 3586–3588, July 2003.
[22] E. Eskin and P. A. Pevzner, “Finding composite regulatory patterns inDNA sequences.” Bioinformatics, vol. 18 Suppl 1, 2002.
[23] P. C. Ma, M. A. Rould, H. Weintraub, and C. O. Pabo, “Crystal structureof myod bhlh domain-dna complex: perspectives on dna recognition andimplications for transcriptional activation.” Cell, vol. 77, no. 3, pp. 451–9, 1994.
[24] S. K. Nair and S. K. Burley, “X-ray structures of myc-max and mad-max recognizing dna. molecular bases of regulation by proto-oncogenictranscription factors.” Cell, vol. 112, no. 2, pp. 193–205, 2003.
[25] X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: discoveringconserved DNA motifs in upstream regulatory regions of co-expressedgenes,” in Pac. Symp. Biocomput., vol. 6, 2001, pp. 127–138.
[26] E. Wijaya, K. Rajaraman, S.-M. Yiu, and W.-K. Sung, “Detection ofgeneric spaced motifs using submotif pattern mining,” Bioinformatics,vol. 23, no. 12, pp. 1476–1485, 2007.
[27] T.-M. Chan, K.-S. Leung, K.-H. Lee, and P. Lio’, “Generic spaced dnamotif discovery using genetic algorithm,” in Evolutionary Computation(CEC), 2010 IEEE Congress on, july 2010, pp. 2647–2654.
[28] Research Collaboratory For Structural Bioinformatics, “RCSB PDBAnnual Report July 2009,” http://www.rcsb.org/pdb/, p. 4, December2009.
[29] G. Tuteja, S. T. Jensen, P. White, and K. H. Kaestner, “Cis-regulatorymodules in the mammalian liver: composition depends on strength ofFoxa2 consensus site,” Nucl. Acids Res., vol. 36, no. 12, pp. 4149–4157,Jul. 2008. [Online]. Available: http://dx.doi.org/10.1093/nar/gkn366
[30] A. E. Kel, Y. Tikunov, N. Voss, J. Borlak, and E. Wingender, “Appli-cation of kernel method to reveal subtypes of tf binding motifs,” inRegulatory Genomics 04, 2004, pp. 42–51.
[31] M. J. Mason, K. Plath, and Q. Zhou, “Identification of context-dependentmotifs by contrasting ChIP binding data.” Bioinformatics (Oxford,England), vol. 26, no. 22, pp. 2826–2832, Nov. 2010.
[32] A. S. Bais, N. Kaminski, and P. V. Benos, “Finding subtypes oftranscription factor motif pairs with distinct regulatory roles,” NucleicAcids Research, vol. 39, no. 11, p. e76, Jun. 2011.
[33] C. M. Klinge, “Estrogen receptor interaction with estrogen responseelements,” Nucleic Acids Res., vol. 29, pp. 2905–2919, 2001.
[34] A. E. Kel, O. V. Kel-Margoulis, P. J. Farnham, S. M. Bartley, E. Wingen-der, and M. Q. Zhang, “Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors,” J. Mol. Biol.,vol. 309, no. 1, pp. 99–120, 2001.
[35] E. Blanco, D. Farre, M. M. Alba, X. Messeguer, and R. Guigo, “ABS:a database of annotated regulatory binding sites from orthologouspromoters,” Nucleic Acids Res., vol. 34, pp. D63–D67, 2006.
[36] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” BiometricsBulletin, vol. 1, no. 6, pp. 80–83, 1945.
[37] X. Chen, H. Xu, P. Yuan, F. Fang, M. Huss, V. B. Vega, E. Wong, Y. L.Orlov, W. Zhang, J. Jiang, Y.-H. Loh, H. C. Yeo, Z. X. Yeo, V. Narang,K. R. Govindarajan, B. Leong, A. Shahab, Y. Ruan, G. Bourque, W.-K. Sung, N. D. Clarke, C.-L. Wei, and H.-H. Ng, “Integration ofExternal Signaling Pathways with the Core Transcriptional Network inEmbryonic Stem Cells,” Cell, vol. 133, no. 6, pp. 1106–1117, Jun. 2008.
[38] G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner, “WebLogo:A Sequence Logo Generator,” Genome Research, vol. 14, no. 6, pp.1188–1190, Jun. 2004.
2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 205