trawler: de novo regulatory motif discovery pipeline for ... · the gat1 motif does not occur often...
TRANSCRIPT
TRAWLER: de novo regulatory motif discovery pipeline for
chromatin immunoprecipitation
Laurence Ettwiller, Benedict Paten, Mirana Ramialison, Ewan Birney & Joachim Wittbrodt
Supplementary figures and text:
Supplementary Figure 1. Details of Trawler’s procedure.
Supplementary Figure 2. Details of the assessment of Trawler.
Supplementary Figure 3. Analysis of the secondary motifs.
Supplementary Figures 4. Analysis of the secondary motifs.
Supplementary Figure 5. Instances of a secondary motif over-representation.
Supplementary Table 1. ChIP experiments used in this study.
Supplementary Table 2. Comparison of motifs found by Trawler and other motif
discovery algorithms on the mammalian data set.
Supplementary Note
Supplementary Methods
a
standard deviation-5 -4 -3 -2
0
1000
2000
3000
-4-5
40
0
1
Nb of motifs
2 4 6 8 10
0
5000
10 000
15 000
20 000
84
0
500
2
standard deviation
Nb of motifs
under-representation over-representation
RandomSample
Family 1 cluster 1
Family 1 cluster 2
Graph representation (70 % cut-off)
Graph representation (70 % cut-off)
Family 1 cluster 1
b Over-represented motifs in the Myog pulled down loci clustered using Myog pulled down
c Over-represented motifs in the Myog pulled down loci clustered using randomly picked sequences
.MCMGCTGSMCAGCTGGMCAGCTGGNCAGCTGGMCASCTGGMCMGCTGGMCARCTGGMCAGYTGGMCAGCYG
RCRGCTGRCAGCTGMACAGCYGM.CMGCTGM.CRGCTGM.CAGCTGM
.MCMGCTGSMCAGCTGGMCAGCTGGNCAGCTGGMCASCTGGMCMGCTGGMCARCTGGMCAGYTGGMCAGCYG.RCRGCTG.RCAGCTGM.ACAGCYGM..CMGCTGM..CRGCTGM..CAGCTGM
family 1 cluster 1
family 1 cluster 1
family 1 cluster 2
Supplementary Figure 1. Details of Trawler’s procedure.
(a) Example of distribution of z-scores in the sample set (red) and the random set (black). The sequences used here are derived from the E2F1- E2F4 data sets. Only motifs that have a score above the limit of distribution of score in the random sequences are considered significant. (b) Motif clustering procedure. For each pairs of motifs, the overlap on the sample sequence is calculated (in percentage) and motifs that overlap 70 percent of the time or more are linked to form a undirected graph. All motifs from a connected sub-graph are part of the same family and are further resolved into cluster(s). In the Myog example, one family is found that can be further resolved into two clusters. (c) Same as (a) this time the sequences used are randomly picked human sequences (1 kb upstream of randomly picked genes).
Supplementary Figure 2. Details of the assessment of Trawler.
(a) Left graph: assessment of Trawler’s performance relative to the other available tools . Correlation coefficient by species. Right graph: different measure of accuracy (sSn, site sensitivity ; sPPV, site Positive Predictive Value; sASP, overall average site performance). Trawler was run blindly without prior knowledge of the true motifs. (b) Detailed table of Fig. 2 showing, for each pulled down experiment, the ability of individual programs to uncover the correct BS in yeast. For each individual ChIP experiment, the success or failure of 7 different algorithms including Trawler is shown. The results from the 6 other algorithms come from Harbison et al.
NONE NONERpn4
NONE NONESfp1
Sip4
NONE NONESkn7
NONE NONESnt2
Sok2
Spt23
NONEStb1
NONE NONEStb4
Stb5
NONE NONESum1
NONESte12
NONE NONESut1
NONE NONESwi4
NONE NONESwi6
Tec1
Tye7
NONE NONEUme6
NONE
NONE NONEYap1
NONE NONEYap7
NONE NONEYdr026C
The Rpn4 motif occurs multiple times per sequence.
Rpn4 motif
Sfp1 motif
The Sfp1 motif occurs multiple times per sequence.
unknown motif 1 unknown motif 2Sip4 motif
Skn7 motif
Snt2 motif
Sok2 motif
Spt23 motif
unknown motif 1
unknown motif 1
unknown motif 3unknown motif 2
unknown motif 2 unknown motif 3
Stb1 motifunknown motif 1
unknown motif 1 unknown motif 2 unknown motif 3 Stb5 motif
Ste12 motifTec1 motif
Sum1 motif
Sut1 motif
Swi4 motif
Swi6 motif
unknown motif 1
unknown motif 1
Tec1 motif Ste12/Dig1 motif
Tye7 motif
Ume6 motif
Yap1 motif
Tye7 motif
Ydr026C motif
Stb4 motif
Abf1 motif pA-pT motif
Abf1
Aft2
Bas1
Cad1
Cbf1
Cin5
Dal82
Dig1
Fhl1
The Abf1motif as described by Harbison et al.is found over-represented together with the pA-pTmotif and an unknow motif. The pA-pT motif also co-occurs with the Abf1 motif at a sequence level.
NONE NONE
NONE NONE
NONE NONE
NONE NONE
NONE NONE
NONE
NONE
NONE
Fkh1
Fkh2
NONE NONE
NONE NONE
Gat1 NONE NONE
NONE NONEGcn4
NONE NONEGln3
NONE NONEHap1
NONE NONEHsf1
Ino2
NONE NONEIno4
NONE NONELeu3
NONE NONEMbp1
NONEMcm1
NONEMet4
Msn2
NONE NONENrg1
Pdr1
NONEPhd1
Pho2
NONE NONEPho4
NONE NONERap1
NONE NONERcs1
NONE NONERds1
NONE NONEReb1
NONE NONERfx1
The Aft2 motif occurs mutliple times per sequence
Aft2 motif
unknown motif 1
unknown motif 1Bas1 motif
The Bas1 motif occurs mutliple times per sequence.
Cad1 motif
Cbf1 motif
Cin5 motif
The Cad1 motif does not occur often multiple times per sequence.
The Cbf1 motif occurs multiple times per sequence.
The Cin5 motif occurs multiple times per sequence.
The Dal82 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is also found in the Pdr1 dataset andseems to correspond to Reb1 or Nrg1 motif.
Dal82 motifReb1/Nrg1 motif
Dig1 motifTec1 motif
The Dig1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Dig1 motif at a sequence level.
Fhl1 motif
The Fhl1 motif occurs mutliple timesper sequence. The Fhl1 motif is the same as theRap1 motif.
The Fkh1 motif occurs multiple times per sequence.
The Fkh2 motif occurs multiple times per sequence.
Fkh1 motif
Fkh2 motif
Gat1 motif
Gcn4 motif
Gln3 motif
The Gcn4 motif occurs multiple times per sequence.
The Gln3 motif occurs multiple times per sequence.
The Gat1 motif does not occur often multiple times per sequence.
Hap1 motif
Hsf1 motif
The Hap1 motif occurs multiple times per sequence.
The Hsf1 motif occurs mutliple times per sequence.
Ino2 motif unknown motif 1 unknown motif 2
Ino4 motif
The Ino2 motif is found over-represented with 2 unknown motifs but they do not often co-occurwith Ino2.
The Ino4 motif occurs multiple times per sequence.
The Leu3 motif does not occur often multiple times per sequence.
The Mbp1 motif occurs multiple times per sequence.
Leu3 motif
Mbp1 motif
unknown motif 1
unknown motif 1 unknown motif 2 unknown motif 3
Mcm1 motif
Met4 motifPho4/Cbf1 motif
The Mcm1 motif is found over-represented with one other unknown motif but it does not often co-occurwith Mcm1 motif.
The Met4 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Pho4/Cbf1 motif. Additionally, Pho4/Cbf1 motif co-occurs with Met4 motif at a sequence level.
Msn2 motif
Nrg1 motif
The Msn2 motif is found over-represented with 3 other unknown motifs but they do not often co-occur with Msn2 motif.
The Nrg1 motif occurs multiple times per sequence.
Pdr1 motif unknown motif 1 unknown motif 2Nrg1/Reb1 motif
The Pdr1 motif is found over-represented with 2 other unknown motifs and a motif thatcorresponds to the Nrg1/ but they do not often co-occur with Pdr1 motif.
Phd1 motifunknown motif 1
The Phd1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown. This unknown motif doesnot co-occurs with Met4 motif at a sequence level.
Pho2 motif unknown motif 1Pho4/Cbf1 motif
The Pho2 motif as described by Harbison et al.is found over-represented together with the Pho4/Cbf1motif and an unknow motif. The Pho4/Cbf1 motif also co-occurs with the Pho2 motif at a sequence level.
Pho4 motif
The Pho4 motif occurs multiple times per sequence.
Rap1 motif
The Rap1 motif occurs multiple timesper sequence. The Rap1 motif is the same as theFHL1 motif.
The Rcs1 motif occurs multiple times per sequence.
The Rds1 motif occurs multiple times per sequence.
The Reb1 motif occurs multiple times per sequence.
Rcs1 motif
Rds1 motif
Reb1 motif
Rfx1 motif
The Rfx1 motif does not occur often multiple times per sequence.
The Skn7 motif occurs multiple times per sequence.
The Snt2 motif occurs multiple times per sequence.
The Sok2 motif is found over-represented with 3 other unknown motifs, two of them seem to be a variante of the motif Sok2.
The Spt23 motif is found over-represented with 3 other unknown motifs, one of them (motif 1) seem to be a variante of the motif Spt23.
The Stb1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown and does not often co-occurs with the Stb1 motif.
The Stb4 motif occurs multiple times per sequence.
The Stb1 motif as described by Harbison et al.is the fourth best motif found by Trawler. The other motifs are unknown and do not often co-occur with the Stb5 motif.
The Ste12 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Ste12 motif at a sequence level (same scenario as Dig1).
The Sum1 motif occurs multiple times per sequence.
The Sut1 motif occurs multiple times per sequence.
The Swi4 motif occurs multiple times per sequence.
The Swi6 motif occurs multiple times per sequence.
The Tec1 motif as described by Harbison et al.is found over-represented together with the Ste12/Dig1motif and an unknown motif. The Ste12/Dig1 motif also co-occurs with the Tec1 motif at a sequence level.
The Tye7 motif is found over-represented with one other unknown motif but it does not often co-occurwith Tye7 motif at a sequence level.
The Ume6 motif occurs multiple times per sequence.
The Yap7 motif occurs multiple times per sequence.
The Ydr026C motif occurs multiple times per sequence.
The Yap1 motif does not occur often multiple times per sequence.
Supplementary Figures 3–4. Analysis of the secondary motifs.
. In yeast, from the 54 data sets that have been correctly analysed by Trawler, 19 have secondary motifs (35.2 %). The total number of secondary motifs is 35 for the yeast data sets.
NONE NONEMyod1
No co-occurence of the NFYA motif with the E2F1-E2F4 motif
Myod1 motif
NONEE21F1-E2F4
E2F1-E2F4 motif NFYA motif
NONE NONEMyog
Myog motif
NONE NONENF-kappa-B
NF-kappa-B motif
HNF4
HNF4 motif
NONE NONEONECUT1 (HNF6)
ONECUT1 motif
NONE NONENOTCH1
NOTCH1 motif
SOX2
POU5F1
The Myod1 motif occurs multiple times
The Myog motif occurs multiple times
The NF-kappa-B motif does not occur multiple time
POU5F1 motif POU5F1 motif
POU5F1
SOX2 motif
unknown motif 1 unknown motif 2
unknown motif 1
unknown motif 1
unknown motif 2 unknown motif 3
unknown motif 3unknown motif 2 unknown motif 4 unknown motif 5
NONE NONECREB1
CREB1 motif
The HNF4 motif occurs multiple times
The ONECUT1 motif occurs multiple times
The NOTCH1 motif occurs multiple times
The CREB1 motif occurs multiple times
The POU5F1 (OCT4) and SOX2 motifs occur multiple times, the other over-expressed motifs do not co-occur with either POU5F1 (OCT4) or SOX2
first motif second motif ......
Supplementary Figures 4. Analysis of the secondary motifs.
For the mammalian data set, 4 data sets out of 9 (44 %) have secondary motifs. The total number of secondary motifs is 13.
E2F known PWM(transfac ID : M00050)
NFYA known PWM(transfac ID : M00185)
over-represented motif family 4
over-represented motif family 8
a
b
NFYA E2F1
c
Tec1 binding site as decribed byHarbison et al.
Tec1 binding sitefound by Trawler
Ste12 binding site as decribed byHarbison et al.
Ste12 binding sitefound by Trawler
d
Ste12Tec1
Supplementary Figure 5. Instances of a secondary motif over-representation.
(a) In yeast, Ste12 BS is found over-represented together with the Tec1 BS in the Tec1 ChIP experiments. Both PWMs found
by Trawler are compared with the PWMs previously described . (b) Example of co-occurrence at the sequence level of the
Tec1 and Ste12 BSs in a conserved region between S. paradoxus, S. mikatae, S. bayanus of the promoter of the glutamine-
fructose-6-phosphate amidotransferase (YKL104C). (c) In Human, the NFYA binding motif is found over-represented
together with the E2F1- E2F4 binding motif in the E2F1- E2F4 ChIP experiment. The PWMs found by Trawler are compared
with the PWM previously described . (d) Example of co-occurrence of E2F1- E2F4 and NFYA BSs in the intergenic region
upstream of the human CDC25A gene. Both sites are conserved from human to opossum. See also Supplementary Notes
for more details.
Supplementary Table 1. ChIP experiments used in this study.
Transcriptionfactor(s)
Species Platform Reference insuppl. notes
Tissues/growth
52 datasets D. melanogasterS. cerevisiaeM. musculusH. sapiens
-Tompa et al1 -
203 yeasttranscriptionalregulators
S. cerevisiae PCR productarray
Harbison etal. 4
Differentconditions andmedia
E2F1 and E2F4 H. sapiens 1.5K andAffymetrixDNA microarrays
Ren et al. 3 Cell culture (WI-38 cells)
SOX2 H. sapiens Agilent array Boyer et al.10
Human ES cell
POU5F1(OCT4)
H. sapiens Agilent array Boyer et al.10
Human ES cell
NANOG H. sapiens Agilent array Boyer et al.10
Human ES cell
Myod1 M. musculus In house primerarrays
Cao et al. 1 Mouse embryofibroblast 12
Myog M. musculus In house primerarrays
Cao et al. 1 Mouse embryofibroblast 12
NF-kappa-B H. sapiens PCR productarray
Schreiber etal. 13
Human U937cells
CREB1 H. sapiens PCR productarray
Zhang et al.14
HEK293T cellsand hepatocytes
HNF4A andONECUT1
H. sapiens PCR productarray
Odom et al.15
hepatocytes andpancreatic islets
NOTCH1 H. sapiens PCR productarray
Palomero etal. 16
T-all cell
Supplementary Table 2. Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set.dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut
Creb program stopped program stopped program stopped
E2f
Hnf4
Hnf6 ATCGAT NO MOTIFFOUND
Myod NO MOTIFFOUND
program stopped
dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut
Myog CAGCTG NO MOTIFFOUND
Nfkb
-
Notch TGGGAA NO MOTIFFOUND
Oct4 ATTTGCAT program stopped program stopped program stopped program stopped
Sox2 C[AT]TTGTT program stopped program stopped program stopped program stopped
The height of the sequences logos from one motif to another should not be compared since in case of multiple motifs, the scale was reduced.
Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set. « Program stopped » indicates experiments
cancelled after more than 4 days of run without producing an output.
Supplementary Notes
Example of clustering of discrete motifs
The clustering of over-represented motifs in the Myog data set illustrates the clustering procedure (Supplementary Fig. 1). One family has been found that can be further resolved into two clusters. The resulting matrices differ in the definition of the flanking sequences while the core motif remains essentially the same. Interestingly if the same set of motifs is clustered using randomly picked genomic sequences instead of the Myog pulled-down loci, no clear cluster can be seen. Thus, only in the case of the Myog pulled -down loci the family is resolved into two distinct position weight matrices (PWMs) and the family is resolved in only one PWM. This result suggests that a clear distinction at a sequence level between the two PWMs is specific of the Myog pulled-down loci, hinting at a biological significance. Indeed, it has been shown that the related TF Myod1 also binds a large fraction of Myog bound loci1. The two PWMs found may represent distinct binding sites for the respective TFs. The different clusters uncovered by Trawler provide direction that can be instantly taken up and addressed by the experimental biologist.
Examples of alignments
Alignments have been performed for all the promoter-based Chromatin Immuno precipitations (ChIPs) experiments studied and the motifs found were mapped to these alignments. The result can be found at http://ani.embl.de/trawler/result_paper/ . To systematically investigate whether potential binding sites are conserved in the immunoprecipitated sequences, pairwise alignments between the reference species and other species from appropriate evolutionary distances are carried out using Blastz2. Taking the human E2F1-E2F4 ChIP data set3 as an example, the E2F1-E2F4 and NFYA motifs found by Trawler were mapped back to the resulting multiple alignments and a conservation score was assigned to each motif (see Supplementary Methods for details). The distribution of conservation scores for each motif in the sample was plotted and compared with the distribution of conservation for the same motifs in upstream sequences of randomly picked genes. To rule out systematic biases due to a high overall conservation of the sequences in the sample, random motifs were created and similarly, the distribution of conservation scores was plotted (see Fig. A below).
Figure A. Distribution of absolute conservation scores for 5 different motif families in (a) random and (b) sample sequences (from E2F1-E2F4 chromatin IP). Two families correspond to E2F1-E2F4 and NFYA binding sites, the 3 other families are random motifs with similar occurrences to the E2F1-E2F4-NFYA motifs. Sites corresponding to the E2F1-E2F4 and NFYA motifs in the sample sequences show a pattern of conservation score very different from the randomly picked sequences with a bimodal distribution of conservation score only present in the sample sequences. Indeed a higher number of sites (10.1 % of the E2F1-E2F4 sites and 38.6% of the NFYA sites) have a absolute conservation score of 4 or above in the sample sequences. Conversely, the random motifs, despite having a slight tendency to be more conserved in the sample sequences, do not show such bimodal distributions. This result shows that a substantial number of E2F1-E2F4 and NFYA like sites are conserved and are likely under purifying selection in the sample sequences, providing an independent validation of Trawler’s prediction. Similar results were obtained with the other data sets tested (data not shown). Conversely, immuno-precipitated sequences also contain a substantial proportion of sites that are not conserved and may not represent functional sites or are fast evolving sites. In the example of the E2F1-E2F4 ChIP sequences, non-conserved motifs matching the E2F1-E2F4 PWM account for 13 percent of all the sites, highlighting the importance of distinguishing the most likely functional sites. Considering these findings, we included a last step in the pipeline that consists of sorting the sites according to the conservation score to retain sites likely to be functional. Thus, both at the level of the abstract motif description as well as at the level of sites, the conservation across evolutionary time allows the assessment of functionality. Details on Trawler’s performance on the yeast data set from Harbison et al. We compared the result from Trawler to the high confidence motifs found in other studies 4 5. Sixty-five motifs were previously found combining the results of six different motif discovery methods 5 and a conservation across yeast species. Fifty-four of these motifs (83%) were found by Trawler alone (Fig 2b, Fig S3 and Fig S4). The total number of families found by Trawler is 112 (corresponding to 263 PWMs), representing at least 48.2 % of true positives (20.5 % in terms of PWMs). The remaining 58 families (209 PWMs) that did not match the previously found PWM can thus represent additional binding sites. On average Trawler found 1.7 families per data set (112/65) or 4 PWMs per data set (263/65). For comparison, the average number of PWMs per data set is 40 for Converge, 2.479 for Kellis, 35 for Mdscan, 6 for MEME
and MEME_c and 170.44 for AlignAce (from the online supporting files of Harbison et al. v24, http://fraenkel.mit.edu/Harbison/release_v24/) see table A (below) for details. Weeder has not been included in the comparison by Harbison et al.
number of real motif
(PWM) found total experiments
percentage of positive motifs
(PWM) average motif (PWM) number
total motif (PWM)
Converge 38 65 58.4 40 2600
AlignACE 38 65 58.4 170.4 11076
Kellis 27 65 41.5 2.4 156
Mdscan 41 65 63.0 35 2275
Meme 36 65 55.4 6 390
Meme_c 38 65 58.5 6 390
Trawler 54 65 83.1 4 263 Table A : summary of the performance of Trawler and other algorithms on the yeast datasets. Only experiments were a known binding site has been found by at least one algorithm (Harbison et al. v24) are used (total 65 experiments). Example of co-occurences
The loci bound by Abf1 in yeast contain, in addition of the canonical Abf1 motif, the polyA-polyT motif over-represented and co-occurring. Abf1 is a multifunctional global regulator with possible chromatin reorganizing and DNA bending activities, whereas the motif polyA-polyT has been shown to induce an intrinsic curvature on the DNA 6. One possible function of this motif is to bring other regulatory elements into proximity of the Abf1 binding site 7. We also found additional over-represented PWMs specific to given cell states in yeast. For example, the loci bound by Dig1 and Ste12 (in addition to the canonical Dig1 and Ste12 site) contain over-represented and co-occurring Tec1 binding sites (Supplementary Fig. 5a) under conditions of filamentous growth in haploid cells 8. This result is in accordance with previous studies showing that genes involved in filamentous growth are bound by the Tec1/Ste12/Dig1 complex. The co-occurrence of Ste12 and Tec1 binding sites can also be detected at the sequence level in many loci involved in filamentous growth including the upstream region of gfa1 gene, an enzyme involved in the first step of the chitin biosynthesis pathway (Supplementary Fig. 5b). Over-representation of additional motifs can also be detected in some ChIP experiments from vertebrates. For example in the E2F1-E2F4 data set 3 analysed, we additionally found the binding site of NFYA (Atf6) over-represented and co-occurring (Supplementary Fig. 5c). Co-occurrences of E2F and NFYA have been previously noted for a limited number of cases in the promoters of cell cycle genes 9. Here we uncover the co-occurrence throughout a large number (95%) of E2F1-E2F4 target genes (Table B below). For instance it is found upstream of the gene coding for the M-phase inducer phosphatase 1 (Supplementary Fig. 5d). Sequences pulled down in the SOX2 ChIP experiment in ES cells also show over-representation of the POU5F1 (OCT4) binding site. This result is in accordance with a previous study 10 that shows a notable overlap between the target genes of POU5F1 (OCT4) and SOX2. Furthermore our data suggest that the POU5F1 immuno-precipitated sequences are also enriched for a PWM that corresponds to Ap1 and Np-y binding sites.
Ensembl Gene ID External Gene ID Description
ENSG00000006634 DBF4 Protein DBF4 homolog
ENSG00000007968 E2F2 Transcription factor E2F2 (E2F-2).
ENSG00000014138 POLA2 DNA polymerase subunit alpha B
ENSG00000029993 HMGB3 High mobility group protein B3
ENSG00000049541 RFC2 Replication factor C subunit 2
ENSG00000051180 RAD51 DNA repair protein RAD51 homolog 1
ENSG00000065243 PKN2 Serine/threonine-protein kinase N2
ENSG00000070761 C16orf80 transcription factor IIB
ENSG00000071794 SMARCA3 SWI/SNF-related matrix-associated
ENSG00000072571 HMMR Hyaluronan mediated motility receptor
ENSG00000076003 MCM6 DNA replication licensing factor MCM6
ENSG00000076242 MLH1 DNA mismatch repair protein Mlh1
ENSG00000076248 UNG Uracil-DNA glycosylase
ENSG00000077152 UBE2T Ubiquitin-conjugating enzyme E2 T
ENSG00000079459 FDFT1 Squalene synthetase
ENSG00000079616 KIF22 Kinesin-like protein KIF22
ENSG00000080986 KNTC2 kinetochore associated 2
ENSG00000082641 NFE2L1 Nuclear factor erythroid 2-related factor 1
ENSG00000085840 ORC1L Origin recognition complex subunit 1
ENSG00000085999 RAD54L DNA repair and recombination protein RAD54-like
ENSG00000090889 KIF4A Chromosome-associated kinesin KIF4A
ENSG00000094804 CDC6 Cell division control protein 6 homolog
ENSG00000094916 CBX5 Chromobox protein homolog 5
ENSG00000095002 MSH2 DNA mismatch repair protein Msh2
ENSG00000100297 MCM5 DNA replication licensing factor MCM5
ENSG00000100714 MTHFD1 C-1-tetrahydrofolate synthase, cytoplasmic
ENSG00000101138 CSTF1 Cleavage stimulation factor 50 kDa subunit
ENSG00000105011 ASF1B anti-silencing function 1B
ENSG00000112118 MCM3 DNA replication licensing factor MCM3
ENSG00000112242 E2F3 Transcription factor E2F3
ENSG00000112526 BRD2_HUMAN Bromodomain-containing protein 2
ENSG00000112742 TTK Dual specificity protein kinase TTK
ENSG00000113810 SMC4_HUMAN Structural maintenance of chromosomes 4
ENSG00000114491 UMPS Uridine 5'-monophosphate synthase
ENSG00000115053 NCL Nucleolin (Protein C23)
ENSG00000117318 ID3 DNA-binding protein inhibitor ID-3
ENSG00000117650 NEK2 Serine/threonine-protein kinase Nek2
ENSG00000119718 EIF2B2 Translation initiation factor eIF-2B subunit beta
ENSG00000120802 TMPO Lamina-associated polypeptide 2
ENSG00000121031 PRKDC DNA-dependent protein kinase catalytic subunit
ENSG00000121774 KHDRBS1 KH domain-containing protein 1
ENSG00000123975 CKS2 Cyclin-dependent kinases regulatory subunit 2
ENSG00000127184 COX7C Cytochrome c oxidase polypeptide
ENSG00000128951 DUT Deoxyuridine 5'-triphosphate nucleotidohydrolase
ENSG00000130175 PRKCSH Glucosidase 2 subunit beta precursor
ENSG00000132646 PCNA Proliferating cell nuclear antigen
ENSG00000133119 RFC3 Replication factor C subunit 3
ENSG00000135069 PSAT1 Phosphoserine aminotransferase
ENSG00000135341 MAP3K7 Mitogen-activated protein kinase kinase kinase 7
ENSG00000136560 TANK TRAF family member
ENSG00000136738 STAM Signal transducing adapter molecule 1
ENSG00000136997 MYC Myc proto-oncogene protein
ENSG00000139146 FAM60A Protein FAM60A
ENSG00000139687 RB1 Retinoblastoma-associated protein
ENSG00000143933 CALM2 Calmodulin
ENSG00000145386 CCNA2 Cyclin-A2
ENSG00000146143 PRIM2A DNA primase large subunit
ENSG00000148773 MKI67 Antigen KI-67
ENSG00000149554 CHEK1 Serine/threonine-protein kinase Chk1
ENSG00000154473 BUB3 Mitotic checkpoint protein BUB3
ENSG00000156475 PPP2R2B Serine/threonine-protein phosphatase 2A
ENSG00000158691 ZNF96 Zinc finger protein 96
ENSG00000161547 SFRS2 Splicing factor, arginine/serine-rich 2
ENSG00000164032 H2AFZ Histone H2A.Z (H2A/z)
ENSG00000164045 CDC25A M-phase inducer phosphatase 1 see fig S5d
ENSG00000166037 CEP57 Centrosomal protein of 57 kDa
ENSG00000166803 KIAA0101 PCNA-associated factor
ENSG00000166851 PLK1 Serine/threonine-protein kinase PLK1
ENSG00000167900 TK1 Thymidine kinase, cytosolic
ENSG00000167978 SRRM2 Serine/arginine repetitive matrix protein 2
ENSG00000168003 SLC3A2 4F2 cell-surface antigen heavy chain (4F2hc)
ENSG00000170312 CDC2 Cell division control protein 2 homolog
ENSG00000172939 OXSR1 Serine/threonine-protein kinase OSR1
ENSG00000180573 HIST1H2AC Histone H2A type 1-C.
ENSG00000183155 RABIF Guanine nucleotide exchange factor MSS4
ENSG00000183558 HIST2H2AA4 Histone H2A type 2-A (H2A.2).
ENSG00000189403 HMGB1 High mobility group protein B1
ENSG00000196747 HIST1H2AI Histone H3.1
ENSG00000197302 Q7Z2F6_HUMAN Zinc finger protein 720
ENSG00000197905 TEAD4 Transcriptional enhancer factor TEF-3
ENSG00000198901 PRC1 Protein regulator of cytokinesis 1
Table B. E2F1-E2F4 bound genes with both the E2F1-E2F4 and NFYA binding sites within 1 kb upstream of the transcription start site in Homo sapiens.
CpG enrichment
When analysing the entire promoter (8kb upstream, 2kb downstream of the TSS) of the target genes of SOX2, POU5F1 and NANOG we found that these regions are strongly enriched in CpG dinucleotides, a very low complexity signal, linked to accessible chromatin. Indeed, it has been shown that CpG dinucleotides can be methylated and consequently tend to undergo deamination, resulting in a cytosine to thymine transition 11. The CpG status in different parts of the genome reflects therefore the landscape of methylation in the germ line cells. CpG methylation in the promoter regions has been shown to be associated with the repression of transcription 11. Hypomethylation of CpG in the promoters of POU5F1 SOX2 and NANOG target genes thus suggest that their promoters are active in the germline. This finding needs further investigation to address the causality of the binding of SOX2, POU5F1 and NANOG to hypomethylated promoters.
References: 1. Cao, Y. et al. Global and gene-specific analyses show distinct roles for Myod and Myog at a common set of promoters. Embo J 25, 502-
11 (2006). 2. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13, 103-7 (2003). 3. Ren, B. et al. E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002). 4. Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004). 5. MacIsaac, K. D. et al. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7, 113 (2006). 6. Hagerman, P. J. Sequence-directed curvature of DNA. Annu Rev Biochem 59, 755-81 (1990). 7. Jensen, L. J. & Knudsen, S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and
functional annotation. Bioinformatics 16, 326-33 (2000). 8. Lorenz, M. C., Cutler, N. S. & Heitman, J. Characterization of alcohol-induced filamentous growth in Saccharomyces cerevisiae. Mol
Biol Cell 11, 183-99 (2000). 9. Matuoka, K. & Yu Chen, K. Nuclear factor Y (NF-Y) and cellular senescence. Exp Cell Res 253, 365-71 (1999). 10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947-56 (2005). 11. Kass, S. U., Pruss, D. & Wolffe, A. P. How does DNA methylation repress transcription? Trends Genet 13, 444-9 (1997). 12. Bergstrom, D. A. et al. Promoter-specific regulation of MyoD binding and signal transduction cooperate to pattern gene expression. Mol
Cell 9, 587-600 (2002). 13. Schreiber, J. et al. Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. Proc Natl
Acad Sci U S A 103, 5899-904 (2006). 14. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene
activation in human tissues. Proc Natl Acad Sci U S A 102, 4459-64 (2005). 15. Odom, D. T. et al. Control of pancreas and liver gene expression by HNF transcription factors. Science 303, 1378-81 (2004). 16. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic
cell growth. Proc Natl Acad Sci U S A 103, 18261-6 (2006).
Supplementary Methods
Motif description
To perform a thorough search for motifs judged significant by our objective function
(see below for definition) the algorithm previously published 1 was used with minor
changes. The input to the initial deterministic pattern scan is a set of foreground
sequences, containing the potential signals of interest, and a set of random
background sequences. As the composition of the background sequence set is of great
importance, the user may supply this, otherwise pre-created sets of randomly sampled
genomic sequences are available. The initial pattern scan explores all words
containing up to a maximum number of degenerate IUPAC nucleotide characters,
excluding triply redundant characters (HBVD). To approximately differentiate the
difference in information content between the four-fold degenerate (N) and two-fold
degenerate characters (RYWSMK), the two-fold degenerate characters were scored
equivalent to one, and N equivalent to two, the resulting degeneracy sum is thus the
limiting factor in pattern enumeration. The pattern search is not limited by word
length, but enumeration is stopped when the total number of occurrences (including
overlaps) falls below a prescribed threshold.
Over-representation assessment
The over-representation of these discrete pattern motifs (discrete-motifs) is assessed
by comparing the numbers of occurrences of motifs in the sample sequences versus a
background sequence set composed of a much larger number of randomly picked
sequences. This is done by using the standard normal-approximation to the binomial
to calculate the z-score. The function does not correct for the effects of overlapping
instances, experiments showed this to have no substantial effect on the results, while
making it much faster to compute (data not shown). To ascertain a sensible threshold
for z-score significance, a randomisation procedure can be repeatedly run (default 20
runs), plotted and the mean score plus two standard deviations taken as the default
threshold. The random sequences are sampled with replacement and used to in-place
of the foreground sequences. The given pattern set is then evaluated and the highest
scoring pattern taken as the result of the run. Though this has certain heuristic
qualities, as an approximation scheme it behaves robustly.
Pattern enumeration is performed depth first using a suffix tree. The suffix tree
implementation is based upon the memory efficient scheme 2. This scheme has the
advantage that large volumes of sequence data (more than 100 megabases) can be
stored on current desktop machines with only 1.5 gigabytes of memory. The output of
this deterministic pattern scan is then the complete set of significant patterns.
Clustering step
To decompose the set of overlapping and redundant putative motifs into a set of
position weight matrices (PWMs) a greedy cluster finding algorithm was employed
upon an undirected graph of motifs (the vertices) and their similarities (edges). Edges
are created in an all-against-all procedure according to the degree of sequence overlap
(which must be more than 70 percent) of motif instances in the sample sequences.
Hits in both strands are analysed to take in account reverse complemented motifs.
Each connected subgraph forms a family and the families are further resolved into
clusters using essentially two methods: [1] the most connected nodes and its directly
connected nodes are removed from the graph to form one cluster. The operation is
iterated until no cluster can be found. Clusters from the same initial subgraph
constitute a family. [2] The percentage overlap for all the motifs constituting a family
is calculated as describe above. The resulting 2D matrix is clustered using the
following distance function: correlation, absolute value of the correlation, uncentered
correlation, absolute uncentered correlation, Spearman’s rank correlation, Kendall’s
tau, Eucleadian distance, city-block distance. The SOM algorithm is then performed 3.
The operation is repeated until the initial connected graph contains no connected
nodes anymore.
Conservation analysis
In order to locate the functional instances of the resulting PWM(s), pairwise
alignments are done using orthologous sequences of the sample sequences using
Blastz 4 with relaxed parameters (K=1000 C=2 P=1 W=6 T=0). Motifs that match the
PWM(s) found are located in the resulting multiple alignments and potential positions
are ranked according to the conservation score. The conservation score is calculated
as the number of cases the motif is found at a particular position in the alignment. For
example if the motif is conserved within 4 species including the reference sequence
the conservation score will be 3 (4 minus the reference sequence).
For the relative conservation score: Let mr,k be the motif at position k in the reference
species r. Let l be the length of the motif and L the total length of the alignment. Pm,s,k
=1 if presence of the motif m in species s at position k in the alignment and Pm,s,k =0
otherwise. The absolute score for motif mr,k is :
€
Smrk = Pm, s, k
i≠r∑
The average score for the alignment is :
€
Ar =Smri
i=1
i= L− l+1∑L − l +1
The relative score for the motif mr,k is :
€
RSmks =SmrkAr
Web interface
The complete analysis, including the alignments is summarized on a web interface
that allows the user to navigate in a sequence or motif centric view. Upon clicking on
a specific PWM, all information concerning the PWM is displayed, including the
locations of the sequence instances used to build the PWM. The over-represented
motifs can also be visualized according to the sequence in which they occur. The
motifs are highlighted within the sequence of interest, or within a multiple sequence
alignment if homologs are provided. Alignments are displayed using the Jalview
program 5, the PWMs logos were drawn using the Weblogo program 6.
Trawler parameters
Trawler default parameters are the following: Minimum motif instances in the sample
sequences (10); Maximum number of mismatch (2); minimum motif length (6 base
pairs (bp)); maximum motif length (20 bp); For all the ChIP analysis the default
parameters were used unless otherwise stated. For the first assessment 7, due to the
small sample size, the minimum motif instances in the sample sequences has been
lowered to 3, the other parameters remain the same.
Secondary motif assessment
In order to test if the over-represented motif also co-occurs with the primary motif,
the sequence around the primary motif was fetched (200 bp for yeast, 1000 bp for
mammals) and over-representation was assessed. If the secondary motif was found
over-represented in the vicinity of the primary motif, then both motifs co-occur also at
a sequence level. In yeast, from the 54 data sets correctly analysed by Trawler, 19
(35.2 %) have secondary motifs (Supplementary Fig. 4). For the mammalian data
set, 4 data sets out of 10 (40%) have secondary motifs (Supplementary Fig. 5).
Algorithms used for speed comparison
The algorithms were tested on a GNU/Linux operating system with i686 processor. In
all cases, the algorithms were run with parameters that were the closest to the Trawler
default parameters used for the mammalian analysis.
AlignACE: The program was run with the defaults parameters and with two different
values for the number of columns to align (n=5 and n=10).
Meme: The program was run with the following parameters: -mod anf -nmotifs 100 -
minw 3 -maxw 20 -revcomp
Motifcut: The parameters were: motif size 6 mers / cluster number 3.
Weeder: The program was run with the « small » option (quick mode, searches for
motif of length 6 bp and 8 bp) and the « large » option (searches for motifs of length 6
bp, 8 bp, 10 bp and 12 bp).
For all progams, the graphical output of the PWM has been drawn using the SeqLogo
program6.
Sequences retrieval and analysis
S.cerevisiae analysis:
Sequences corresponding to the different data sets as well as the background were
downloaded from the website (http://fraenkel.mit.edu/Harbison/release_v24/ FASTA
formatted files) and repeat masked for low complexity repeat using Repeat masker
(http://repeatmasker.org). The background sequences used correspond to sequences
for all probes on the 6k microarray
(http://fraenkel.mit.edu/Harbison/release_v24/yeast_Young_6k.fsa) and repeat
masked for low complexity repeat using Repeat masker. All the experiments were
analysed by Trawler using the default parameters.
Mammalian analysis:
E2F1 – E2F4 analysis: The human EnsEMBL ID of the previously published data of
Ren et al. 8 were retrieved and the 700 bp upstream and 200 bp downstream sequence
of the annotated genes start site (EnsEMBL version 38 9) were repeat and exon
masked. The sample sequences correspond to the promoter region of genes in their
Table 3 and the background sequences correspond to the promoter region of genes in
their supplemental research data
(http://www.genesdev.org/cgi/content/full/16/2/245/DC1).
For Fig. 1b, a randomized data set composed of the same number of sequences as in
the sample set was produced by randomly picking sequences from the background.
Trawler was run with the default parameters.
Myod/Myog: The mouse genes targeted by either Myod or Myog in both MDER and
C2C12 cells are found here
(http://www.nature.com/emboj/journal/v25/n3/extref/7600958s3.pdf). The 750 bp
upstream and 250 bp downstream sequences of the annotated Transcription start site
(TSS) (EnsEMBL version 38 9) were repeat and exon masked.
HNF4A and ONECUT1 (HNF6): All the data were downloaded from
(http://jura.wi.mit.edu/young_public/autoregulation/downloaddata.html). The 750 bp
upstream and 250 bp downstream sequences of the annotated TSS (EnsEMBL version
38 9) were repeat and exon masked.
POU5F1 (OCT4), SOX2 and NANOG analysis: Three data sets coming from the
same experiment 10 have been analysed. First the exact loci where the transcription
factor (TF) has been bound were extracted from MacIsaac et al. 11. Secondly, the
entire promoter regions (10kb) that include the bound loci have also been analysed
using the data from10.
NOTCH1: The target genes were taken from the supporting Table 1 of 12 and the 3 kb
upstream human sequences were retrieved from EnsEMBL 41 (repeat masked
sequences). The background corresponds to 3 kb sequences upstream of 2000
randomly picked genes. Trawler was run with the default parameters (but minimum
occurrence of motifs 20).
CREB1: The data were downloaded from 16. The first 200 genes recorded in their
Supplementary Table 6 were used. The 750 bp upstream and 250 bp downstream
sequences of the annotated TSS (EnsEMBL version 38 9) were repeat and exon
masked.
NF-kappa-B: The data were downloaded from (http://web.wi.mit.edu/young/nfkb/).
The 700 bp upstream and 200 bp downstream sequences of the annotated TSS
(EnsEMBL version 38 9) were repeat and exon masked.
The regions used in the Trawler analysis correspond to regions described in the ChIP
of the corresponding analysis. The background corresponds to the upstream region of
appropriate length of 2000 randomly picked genes. Trawler was run with the default
parameters.
References:
1. Ettwiller, L. et al. The discovery, positioning and verification of a set oftranscription-associated motifs in vertebrates. Genome Biol 6, R104 (2005).
2. Kurtz, S. Reducing the space requirements of suffix trees. Software-PractiseExperience 29, 1149:1171 (1999).
3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysisand display of genome-wide expression patterns. Proc Natl Acad Sci U S A95, 14863-8 (1998).
4. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13,103-7 (2003).
5. Clamp, M., Cuff, J., Searle, S. M. & Barton, G. J. The Jalview Java alignmenteditor. Bioinformatics 20, 426-7 (2004).
6. Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: asequence logo generator. Genome Res 14, 1188-90 (2004).
7. Tompa, M. et al. Assessing computational tools for the discovery oftranscription factor binding sites. Nat Biotechnol 23, 137-44 (2005).
8. Ren, B. et al. E2F integrates cell cycle progression with DNA repair,replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002).
9. Birney, E. et al. Ensembl 2006. Nucleic Acids Res 34, D556-61 (2006).10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human
embryonic stem cells. Cell 122, 947-56 (2005).11. Macisaac, K. D. et al. A hypothesis-based approach for identifying the binding
specificity of regulatory proteins from chromatin immunoprecipitation data.Bioinformatics 22, 423-9 (2006).
12. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth. ProcNatl Acad Sci U S A 103, 18261-6 (2006).