trawler: de novo regulatory motif discovery pipeline for ... · the gat1 motif does not occur often...

TRAWLER: de novo regulatory motif discovery pipeline for

chromatin immunoprecipitation

Laurence Ettwiller, Benedict Paten, Mirana Ramialison, Ewan Birney & Joachim Wittbrodt

Supplementary figures and text:

Supplementary Figure 1. Details of Trawler’s procedure.

Supplementary Figure 2. Details of the assessment of Trawler.

Supplementary Figure 3. Analysis of the secondary motifs.

Supplementary Figures 4. Analysis of the secondary motifs.

Supplementary Figure 5. Instances of a secondary motif over-representation.

Supplementary Table 1. ChIP experiments used in this study.

Supplementary Table 2. Comparison of motifs found by Trawler and other motif

discovery algorithms on the mammalian data set.

Supplementary Note

Supplementary Methods

a

standard deviation-5 -4 -3 -2

0

1000

2000

3000

-4-5

40

0

1

Nb of motifs

2 4 6 8 10

0

5000

10 000

15 000

20 000

84

0

500

2

standard deviation

Nb of motifs

under-representation over-representation

RandomSample

Family 1 cluster 1

Family 1 cluster 2

Graph representation (70 % cut-off)

Graph representation (70 % cut-off)

Family 1 cluster 1

b Over-represented motifs in the Myog pulled down loci clustered using Myog pulled down

c Over-represented motifs in the Myog pulled down loci clustered using randomly picked sequences

.MCMGCTGSMCAGCTGGMCAGCTGGNCAGCTGGMCASCTGGMCMGCTGGMCARCTGGMCAGYTGGMCAGCYG

RCRGCTGRCAGCTGMACAGCYGM.CMGCTGM.CRGCTGM.CAGCTGM

.MCMGCTGSMCAGCTGGMCAGCTGGNCAGCTGGMCASCTGGMCMGCTGGMCARCTGGMCAGYTGGMCAGCYG.RCRGCTG.RCAGCTGM.ACAGCYGM..CMGCTGM..CRGCTGM..CAGCTGM

family 1 cluster 1

family 1 cluster 1

family 1 cluster 2

Supplementary Figure 1. Details of Trawler’s procedure.

(a) Example of distribution of z-scores in the sample set (red) and the random set (black). The sequences used here are derived from the E2F1- E2F4 data sets. Only motifs that have a score above the limit of distribution of score in the random sequences are considered significant. (b) Motif clustering procedure. For each pairs of motifs, the overlap on the sample sequence is calculated (in percentage) and motifs that overlap 70 percent of the time or more are linked to form a undirected graph. All motifs from a connected sub-graph are part of the same family and are further resolved into cluster(s). In the Myog example, one family is found that can be further resolved into two clusters. (c) Same as (a) this time the sequences used are randomly picked human sequences (1 kb upstream of randomly picked genes).

Supplementary Figure 2. Details of the assessment of Trawler.

(a) Left graph: assessment of Trawler’s performance relative to the other available tools . Correlation coefficient by species. Right graph: different measure of accuracy (sSn, site sensitivity ; sPPV, site Positive Predictive Value; sASP, overall average site performance). Trawler was run blindly without prior knowledge of the true motifs. (b) Detailed table of Fig. 2 showing, for each pulled down experiment, the ability of individual programs to uncover the correct BS in yeast. For each individual ChIP experiment, the success or failure of 7 different algorithms including Trawler is shown. The results from the 6 other algorithms come from Harbison et al.

NONE NONERpn4

NONE NONESfp1

Sip4

NONE NONESkn7

NONE NONESnt2

Sok2

Spt23

NONEStb1

NONE NONEStb4

Stb5

NONE NONESum1

NONESte12

NONE NONESut1

NONE NONESwi4

NONE NONESwi6

Tec1

Tye7

NONE NONEUme6

NONE

NONE NONEYap1

NONE NONEYap7

NONE NONEYdr026C

The Rpn4 motif occurs multiple times per sequence.

Rpn4 motif

Sfp1 motif

The Sfp1 motif occurs multiple times per sequence.

unknown motif 1 unknown motif 2Sip4 motif

Skn7 motif

Snt2 motif

Sok2 motif

Spt23 motif

unknown motif 1

unknown motif 1

unknown motif 3unknown motif 2

unknown motif 2 unknown motif 3

Stb1 motifunknown motif 1

unknown motif 1 unknown motif 2 unknown motif 3 Stb5 motif

Ste12 motifTec1 motif

Sum1 motif

Sut1 motif

Swi4 motif

Swi6 motif

unknown motif 1

unknown motif 1

Tec1 motif Ste12/Dig1 motif

Tye7 motif

Ume6 motif

Yap1 motif

Tye7 motif

Ydr026C motif

Stb4 motif

Abf1 motif pA-pT motif

Abf1

Aft2

Bas1

Cad1

Cbf1

Cin5

Dal82

Dig1

Fhl1

The Abf1motif as described by Harbison et al.is found over-represented together with the pA-pTmotif and an unknow motif. The pA-pT motif also co-occurs with the Abf1 motif at a sequence level.

NONE NONE

NONE NONE

NONE NONE

NONE NONE

NONE NONE

NONE

NONE

NONE

Fkh1

Fkh2

NONE NONE

NONE NONE

Gat1 NONE NONE

NONE NONEGcn4

NONE NONEGln3

NONE NONEHap1

NONE NONEHsf1

Ino2

NONE NONEIno4

NONE NONELeu3

NONE NONEMbp1

NONEMcm1

NONEMet4

Msn2

NONE NONENrg1

Pdr1

NONEPhd1

Pho2

NONE NONEPho4

NONE NONERap1

NONE NONERcs1

NONE NONERds1

NONE NONEReb1

NONE NONERfx1

The Aft2 motif occurs mutliple times per sequence

Aft2 motif

unknown motif 1

unknown motif 1Bas1 motif

The Bas1 motif occurs mutliple times per sequence.

Cad1 motif

Cbf1 motif

Cin5 motif

The Cad1 motif does not occur often multiple times per sequence.

The Cbf1 motif occurs multiple times per sequence.

The Cin5 motif occurs multiple times per sequence.

The Dal82 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is also found in the Pdr1 dataset andseems to correspond to Reb1 or Nrg1 motif.

Dal82 motifReb1/Nrg1 motif

Dig1 motifTec1 motif

The Dig1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Dig1 motif at a sequence level.

Fhl1 motif

The Fhl1 motif occurs mutliple timesper sequence. The Fhl1 motif is the same as theRap1 motif.

The Fkh1 motif occurs multiple times per sequence.

The Fkh2 motif occurs multiple times per sequence.

Fkh1 motif

Fkh2 motif

Gat1 motif

Gcn4 motif

Gln3 motif

The Gcn4 motif occurs multiple times per sequence.

The Gln3 motif occurs multiple times per sequence.

The Gat1 motif does not occur often multiple times per sequence.

Hap1 motif

Hsf1 motif

The Hap1 motif occurs multiple times per sequence.

The Hsf1 motif occurs mutliple times per sequence.

Ino2 motif unknown motif 1 unknown motif 2

Ino4 motif

The Ino2 motif is found over-represented with 2 unknown motifs but they do not often co-occurwith Ino2.

The Ino4 motif occurs multiple times per sequence.

The Leu3 motif does not occur often multiple times per sequence.

The Mbp1 motif occurs multiple times per sequence.

Leu3 motif

Mbp1 motif

unknown motif 1

unknown motif 1 unknown motif 2 unknown motif 3

Mcm1 motif

Met4 motifPho4/Cbf1 motif

The Mcm1 motif is found over-represented with one other unknown motif but it does not often co-occurwith Mcm1 motif.

The Met4 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Pho4/Cbf1 motif. Additionally, Pho4/Cbf1 motif co-occurs with Met4 motif at a sequence level.

Msn2 motif

Nrg1 motif

The Msn2 motif is found over-represented with 3 other unknown motifs but they do not often co-occur with Msn2 motif.

The Nrg1 motif occurs multiple times per sequence.

Pdr1 motif unknown motif 1 unknown motif 2Nrg1/Reb1 motif

The Pdr1 motif is found over-represented with 2 other unknown motifs and a motif thatcorresponds to the Nrg1/ but they do not often co-occur with Pdr1 motif.

Phd1 motifunknown motif 1

The Phd1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown. This unknown motif doesnot co-occurs with Met4 motif at a sequence level.

Pho2 motif unknown motif 1Pho4/Cbf1 motif

The Pho2 motif as described by Harbison et al.is found over-represented together with the Pho4/Cbf1motif and an unknow motif. The Pho4/Cbf1 motif also co-occurs with the Pho2 motif at a sequence level.

Pho4 motif

The Pho4 motif occurs multiple times per sequence.

Rap1 motif

The Rap1 motif occurs multiple timesper sequence. The Rap1 motif is the same as theFHL1 motif.

The Rcs1 motif occurs multiple times per sequence.

The Rds1 motif occurs multiple times per sequence.

The Reb1 motif occurs multiple times per sequence.

Rcs1 motif

Rds1 motif

Reb1 motif

Rfx1 motif

The Rfx1 motif does not occur often multiple times per sequence.

The Skn7 motif occurs multiple times per sequence.

The Snt2 motif occurs multiple times per sequence.

The Sok2 motif is found over-represented with 3 other unknown motifs, two of them seem to be a variante of the motif Sok2.

The Spt23 motif is found over-represented with 3 other unknown motifs, one of them (motif 1) seem to be a variante of the motif Spt23.

The Stb1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown and does not often co-occurs with the Stb1 motif.

The Stb4 motif occurs multiple times per sequence.

The Stb1 motif as described by Harbison et al.is the fourth best motif found by Trawler. The other motifs are unknown and do not often co-occur with the Stb5 motif.

The Ste12 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Ste12 motif at a sequence level (same scenario as Dig1).

The Sum1 motif occurs multiple times per sequence.

The Sut1 motif occurs multiple times per sequence.

The Swi4 motif occurs multiple times per sequence.

The Swi6 motif occurs multiple times per sequence.

The Tec1 motif as described by Harbison et al.is found over-represented together with the Ste12/Dig1motif and an unknown motif. The Ste12/Dig1 motif also co-occurs with the Tec1 motif at a sequence level.

The Tye7 motif is found over-represented with one other unknown motif but it does not often co-occurwith Tye7 motif at a sequence level.

The Ume6 motif occurs multiple times per sequence.

The Yap7 motif occurs multiple times per sequence.

The Ydr026C motif occurs multiple times per sequence.

The Yap1 motif does not occur often multiple times per sequence.

Supplementary Figures 3–4. Analysis of the secondary motifs.

. In yeast, from the 54 data sets that have been correctly analysed by Trawler, 19 have secondary motifs (35.2 %). The total number of secondary motifs is 35 for the yeast data sets.

NONE NONEMyod1

No co-occurence of the NFYA motif with the E2F1-E2F4 motif

Myod1 motif

NONEE21F1-E2F4

E2F1-E2F4 motif NFYA motif

NONE NONEMyog

Myog motif

NONE NONENF-kappa-B

NF-kappa-B motif

HNF4

HNF4 motif

NONE NONEONECUT1 (HNF6)

ONECUT1 motif

NONE NONENOTCH1

NOTCH1 motif

SOX2

POU5F1

The Myod1 motif occurs multiple times

The Myog motif occurs multiple times

The NF-kappa-B motif does not occur multiple time

POU5F1 motif POU5F1 motif

POU5F1

SOX2 motif


unknown motif 1

unknown motif 1


unknown motif 3unknown motif 2 unknown motif 4 unknown motif 5

NONE NONECREB1

CREB1 motif

The HNF4 motif occurs multiple times

The ONECUT1 motif occurs multiple times

The NOTCH1 motif occurs multiple times

The CREB1 motif occurs multiple times

The POU5F1 (OCT4) and SOX2 motifs occur multiple times, the other over-expressed motifs do not co-occur with either POU5F1 (OCT4) or SOX2

first motif second motif ......

Supplementary Figures 4. Analysis of the secondary motifs.

For the mammalian data set, 4 data sets out of 9 (44 %) have secondary motifs. The total number of secondary motifs is 13.

E2F known PWM(transfac ID : M00050)

NFYA known PWM(transfac ID : M00185)

over-represented motif family 4

over-represented motif family 8

a

b

NFYA E2F1

c

Tec1 binding site as decribed byHarbison et al.

Tec1 binding sitefound by Trawler

Ste12 binding site as decribed byHarbison et al.

Ste12 binding sitefound by Trawler

d

Ste12Tec1

Supplementary Figure 5. Instances of a secondary motif over-representation.

(a) In yeast, Ste12 BS is found over-represented together with the Tec1 BS in the Tec1 ChIP experiments. Both PWMs found

by Trawler are compared with the PWMs previously described . (b) Example of co-occurrence at the sequence level of the

Tec1 and Ste12 BSs in a conserved region between S. paradoxus, S. mikatae, S. bayanus of the promoter of the glutamine-

fructose-6-phosphate amidotransferase (YKL104C). (c) In Human, the NFYA binding motif is found over-represented

together with the E2F1- E2F4 binding motif in the E2F1- E2F4 ChIP experiment. The PWMs found by Trawler are compared

with the PWM previously described . (d) Example of co-occurrence of E2F1- E2F4 and NFYA BSs in the intergenic region

upstream of the human CDC25A gene. Both sites are conserved from human to opossum. See also Supplementary Notes

for more details.

Supplementary Table 1. ChIP experiments used in this study.

Transcriptionfactor(s)

Species Platform Reference insuppl. notes

Tissues/growth

52 datasets D. melanogasterS. cerevisiaeM. musculusH. sapiens

-Tompa et al1 -

203 yeasttranscriptionalregulators

S. cerevisiae PCR productarray

Harbison etal. 4

Differentconditions andmedia

E2F1 and E2F4 H. sapiens 1.5K andAffymetrixDNA microarrays

Ren et al. 3 Cell culture (WI-38 cells)

SOX2 H. sapiens Agilent array Boyer et al.10

Human ES cell

POU5F1(OCT4)

H. sapiens Agilent array Boyer et al.10

Human ES cell

NANOG H. sapiens Agilent array Boyer et al.10

Human ES cell

Myod1 M. musculus In house primerarrays

Cao et al. 1 Mouse embryofibroblast 12

Myog M. musculus In house primerarrays

Cao et al. 1 Mouse embryofibroblast 12

NF-kappa-B H. sapiens PCR productarray

Schreiber etal. 13

Human U937cells

CREB1 H. sapiens PCR productarray

Zhang et al.14

HEK293T cellsand hepatocytes

HNF4A andONECUT1

H. sapiens PCR productarray

Odom et al.15

hepatocytes andpancreatic islets

NOTCH1 H. sapiens PCR productarray

Palomero etal. 16

T-all cell

Supplementary Table 2. Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set.dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut

Creb program stopped program stopped program stopped

E2f

Hnf4

Hnf6 ATCGAT NO MOTIFFOUND

Myod NO MOTIFFOUND

program stopped

dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut

Myog CAGCTG NO MOTIFFOUND

Nfkb

-

Notch TGGGAA NO MOTIFFOUND

Oct4 ATTTGCAT program stopped program stopped program stopped program stopped

Sox2 C[AT]TTGTT program stopped program stopped program stopped program stopped

The height of the sequences logos from one motif to another should not be compared since in case of multiple motifs, the scale was reduced.

Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set. « Program stopped » indicates experiments

cancelled after more than 4 days of run without producing an output.

Supplementary Notes

Example of clustering of discrete motifs

The clustering of over-represented motifs in the Myog data set illustrates the clustering procedure (Supplementary Fig. 1). One family has been found that can be further resolved into two clusters. The resulting matrices differ in the definition of the flanking sequences while the core motif remains essentially the same. Interestingly if the same set of motifs is clustered using randomly picked genomic sequences instead of the Myog pulled-down loci, no clear cluster can be seen. Thus, only in the case of the Myog pulled -down loci the family is resolved into two distinct position weight matrices (PWMs) and the family is resolved in only one PWM. This result suggests that a clear distinction at a sequence level between the two PWMs is specific of the Myog pulled-down loci, hinting at a biological significance. Indeed, it has been shown that the related TF Myod1 also binds a large fraction of Myog bound loci1. The two PWMs found may represent distinct binding sites for the respective TFs. The different clusters uncovered by Trawler provide direction that can be instantly taken up and addressed by the experimental biologist.

Examples of alignments

Alignments have been performed for all the promoter-based Chromatin Immuno precipitations (ChIPs) experiments studied and the motifs found were mapped to these alignments. The result can be found at http://ani.embl.de/trawler/result_paper/ . To systematically investigate whether potential binding sites are conserved in the immunoprecipitated sequences, pairwise alignments between the reference species and other species from appropriate evolutionary distances are carried out using Blastz2. Taking the human E2F1-E2F4 ChIP data set3 as an example, the E2F1-E2F4 and NFYA motifs found by Trawler were mapped back to the resulting multiple alignments and a conservation score was assigned to each motif (see Supplementary Methods for details). The distribution of conservation scores for each motif in the sample was plotted and compared with the distribution of conservation for the same motifs in upstream sequences of randomly picked genes. To rule out systematic biases due to a high overall conservation of the sequences in the sample, random motifs were created and similarly, the distribution of conservation scores was plotted (see Fig. A below).

http://ani.embl.de/trawler/result_paper/

Figure A. Distribution of absolute conservation scores for 5 different motif families in (a) random and (b) sample sequences (from E2F1-E2F4 chromatin IP). Two families correspond to E2F1-E2F4 and NFYA binding sites, the 3 other families are random motifs with similar occurrences to the E2F1-E2F4-NFYA motifs. Sites corresponding to the E2F1-E2F4 and NFYA motifs in the sample sequences show a pattern of conservation score very different from the randomly picked sequences with a bimodal distribution of conservation score only present in the sample sequences. Indeed a higher number of sites (10.1 % of the E2F1-E2F4 sites and 38.6% of the NFYA sites) have a absolute conservation score of 4 or above in the sample sequences. Conversely, the random motifs, despite having a slight tendency to be more conserved in the sample sequences, do not show such bimodal distributions. This result shows that a substantial number of E2F1-E2F4 and NFYA like sites are conserved and are likely under purifying selection in the sample sequences, providing an independent validation of Trawler’s prediction. Similar results were obtained with the other data sets tested (data not shown). Conversely, immuno-precipitated sequences also contain a substantial proportion of sites that are not conserved and may not represent functional sites or are fast evolving sites. In the example of the E2F1-E2F4 ChIP sequences, non-conserved motifs matching the E2F1-E2F4 PWM account for 13 percent of all the sites, highlighting the importance of distinguishing the most likely functional sites. Considering these findings, we included a last step in the pipeline that consists of sorting the sites according to the conservation score to retain sites likely to be functional. Thus, both at the level of the abstract motif description as well as at the level of sites, the conservation across evolutionary time allows the assessment of functionality. Details on Trawler’s performance on the yeast data set from Harbison et al. We compared the result from Trawler to the high confidence motifs found in other studies 4 5. Sixty-five motifs were previously found combining the results of six different motif discovery methods 5 and a conservation across yeast species. Fifty-four of these motifs (83%) were found by Trawler alone (Fig 2b, Fig S3 and Fig S4). The total number of families found by Trawler is 112 (corresponding to 263 PWMs), representing at least 48.2 % of true positives (20.5 % in terms of PWMs). The remaining 58 families (209 PWMs) that did not match the previously found PWM can thus represent additional binding sites. On average Trawler found 1.7 families per data set (112/65) or 4 PWMs per data set (263/65). For comparison, the average number of PWMs per data set is 40 for Converge, 2.479 for Kellis, 35 for Mdscan, 6 for MEME

and MEME_c and 170.44 for AlignAce (from the online supporting files of Harbison et al. v24, http://fraenkel.mit.edu/Harbison/release_v24/) see table A (below) for details. Weeder has not been included in the comparison by Harbison et al.

number of real motif

(PWM) found total experiments

percentage of positive motifs

(PWM) average motif (PWM) number

total motif (PWM)

Converge 38 65 58.4 40 2600

AlignACE 38 65 58.4 170.4 11076

Kellis 27 65 41.5 2.4 156

Mdscan 41 65 63.0 35 2275

Meme 36 65 55.4 6 390

Meme_c 38 65 58.5 6 390

Trawler 54 65 83.1 4 263 Table A : summary of the performance of Trawler and other algorithms on the yeast datasets. Only experiments were a known binding site has been found by at least one algorithm (Harbison et al. v24) are used (total 65 experiments). Example of co-occurences

http://fraenkel.mit.edu/Harbison/release_v24/

The loci bound by Abf1 in yeast contain, in addition of the canonical Abf1 motif, the polyA-polyT motif over-represented and co-occurring. Abf1 is a multifunctional global regulator with possible chromatin reorganizing and DNA bending activities, whereas the motif polyA-polyT has been shown to induce an intrinsic curvature on the DNA 6. One possible function of this motif is to bring other regulatory elements into proximity of the Abf1 binding site 7. We also found additional over-represented PWMs specific to given cell states in yeast. For example, the loci bound by Dig1 and Ste12 (in addition to the canonical Dig1 and Ste12 site) contain over-represented and co-occurring Tec1 binding sites (Supplementary Fig. 5a) under conditions of filamentous growth in haploid cells 8. This result is in accordance with previous studies showing that genes involved in filamentous growth are bound by the Tec1/Ste12/Dig1 complex. The co-occurrence of Ste12 and Tec1 binding sites can also be detected at the sequence level in many loci involved in filamentous growth including the upstream region of gfa1 gene, an enzyme involved in the first step of the chitin biosynthesis pathway (Supplementary Fig. 5b). Over-representation of additional motifs can also be detected in some ChIP experiments from vertebrates. For example in the E2F1-E2F4 data set 3 analysed, we additionally found the binding site of NFYA (Atf6) over-represented and co-occurring (Supplementary Fig. 5c). Co-occurrences of E2F and NFYA have been previously noted for a limited number of cases in the promoters of cell cycle genes 9. Here we uncover the co-occurrence throughout a large number (95%) of E2F1-E2F4 target genes (Table B below). For instance it is found upstream of the gene coding for the M-phase inducer phosphatase 1 (Supplementary Fig. 5d). Sequences pulled down in the SOX2 ChIP experiment in ES cells also show over-representation of the POU5F1 (OCT4) binding site. This result is in accordance with a previous study 10 that shows a notable overlap between the target genes of POU5F1 (OCT4) and SOX2. Furthermore our data suggest that the POU5F1 immuno-precipitated sequences are also enriched for a PWM that corresponds to Ap1 and Np-y binding sites.

Ensembl Gene ID External Gene ID Description

ENSG00000006634 DBF4 Protein DBF4 homolog

ENSG00000007968 E2F2 Transcription factor E2F2 (E2F-2).

ENSG00000014138 POLA2 DNA polymerase subunit alpha B

ENSG00000029993 HMGB3 High mobility group protein B3

ENSG00000049541 RFC2 Replication factor C subunit 2

ENSG00000051180 RAD51 DNA repair protein RAD51 homolog 1

ENSG00000065243 PKN2 Serine/threonine-protein kinase N2

ENSG00000070761 C16orf80 transcription factor IIB

ENSG00000071794 SMARCA3 SWI/SNF-related matrix-associated

ENSG00000072571 HMMR Hyaluronan mediated motility receptor

ENSG00000076003 MCM6 DNA replication licensing factor MCM6

ENSG00000076242 MLH1 DNA mismatch repair protein Mlh1

ENSG00000076248 UNG Uracil-DNA glycosylase

ENSG00000077152 UBE2T Ubiquitin-conjugating enzyme E2 T

ENSG00000079459 FDFT1 Squalene synthetase

ENSG00000079616 KIF22 Kinesin-like protein KIF22

ENSG00000080986 KNTC2 kinetochore associated 2

ENSG00000082641 NFE2L1 Nuclear factor erythroid 2-related factor 1

ENSG00000085840 ORC1L Origin recognition complex subunit 1

ENSG00000085999 RAD54L DNA repair and recombination protein RAD54-like

ENSG00000090889 KIF4A Chromosome-associated kinesin KIF4A

ENSG00000094804 CDC6 Cell division control protein 6 homolog

ENSG00000094916 CBX5 Chromobox protein homolog 5

ENSG00000095002 MSH2 DNA mismatch repair protein Msh2


ENSG00000100714 MTHFD1 C-1-tetrahydrofolate synthase, cytoplasmic

ENSG00000101138 CSTF1 Cleavage stimulation factor 50 kDa subunit

ENSG00000105011 ASF1B anti-silencing function 1B


ENSG00000112242 E2F3 Transcription factor E2F3

ENSG00000112526 BRD2_HUMAN Bromodomain-containing protein 2

ENSG00000112742 TTK Dual specificity protein kinase TTK

ENSG00000113810 SMC4_HUMAN Structural maintenance of chromosomes 4

ENSG00000114491 UMPS Uridine 5'-monophosphate synthase

ENSG00000115053 NCL Nucleolin (Protein C23)

ENSG00000117318 ID3 DNA-binding protein inhibitor ID-3

ENSG00000117650 NEK2 Serine/threonine-protein kinase Nek2

ENSG00000119718 EIF2B2 Translation initiation factor eIF-2B subunit beta

ENSG00000120802 TMPO Lamina-associated polypeptide 2

ENSG00000121031 PRKDC DNA-dependent protein kinase catalytic subunit

ENSG00000121774 KHDRBS1 KH domain-containing protein 1

ENSG00000123975 CKS2 Cyclin-dependent kinases regulatory subunit 2

ENSG00000127184 COX7C Cytochrome c oxidase polypeptide

ENSG00000128951 DUT Deoxyuridine 5'-triphosphate nucleotidohydrolase

ENSG00000130175 PRKCSH Glucosidase 2 subunit beta precursor

ENSG00000132646 PCNA Proliferating cell nuclear antigen

ENSG00000133119 RFC3 Replication factor C subunit 3

ENSG00000135069 PSAT1 Phosphoserine aminotransferase

ENSG00000135341 MAP3K7 Mitogen-activated protein kinase kinase kinase 7

ENSG00000136560 TANK TRAF family member

ENSG00000136738 STAM Signal transducing adapter molecule 1

ENSG00000136997 MYC Myc proto-oncogene protein

ENSG00000139146 FAM60A Protein FAM60A

ENSG00000139687 RB1 Retinoblastoma-associated protein

ENSG00000143933 CALM2 Calmodulin

ENSG00000145386 CCNA2 Cyclin-A2

ENSG00000146143 PRIM2A DNA primase large subunit

ENSG00000148773 MKI67 Antigen KI-67

ENSG00000149554 CHEK1 Serine/threonine-protein kinase Chk1

ENSG00000154473 BUB3 Mitotic checkpoint protein BUB3

ENSG00000156475 PPP2R2B Serine/threonine-protein phosphatase 2A

ENSG00000158691 ZNF96 Zinc finger protein 96

ENSG00000161547 SFRS2 Splicing factor, arginine/serine-rich 2

ENSG00000164032 H2AFZ Histone H2A.Z (H2A/z)

ENSG00000164045 CDC25A M-phase inducer phosphatase 1 see fig S5d

ENSG00000166037 CEP57 Centrosomal protein of 57 kDa

ENSG00000166803 KIAA0101 PCNA-associated factor

ENSG00000166851 PLK1 Serine/threonine-protein kinase PLK1

ENSG00000167900 TK1 Thymidine kinase, cytosolic

ENSG00000167978 SRRM2 Serine/arginine repetitive matrix protein 2

ENSG00000168003 SLC3A2 4F2 cell-surface antigen heavy chain (4F2hc)

ENSG00000170312 CDC2 Cell division control protein 2 homolog

ENSG00000172939 OXSR1 Serine/threonine-protein kinase OSR1

ENSG00000180573 HIST1H2AC Histone H2A type 1-C.

ENSG00000183155 RABIF Guanine nucleotide exchange factor MSS4

ENSG00000183558 HIST2H2AA4 Histone H2A type 2-A (H2A.2).

ENSG00000189403 HMGB1 High mobility group protein B1

ENSG00000196747 HIST1H2AI Histone H3.1

ENSG00000197302 Q7Z2F6_HUMAN Zinc finger protein 720

ENSG00000197905 TEAD4 Transcriptional enhancer factor TEF-3

ENSG00000198901 PRC1 Protein regulator of cytokinesis 1

Table B. E2F1-E2F4 bound genes with both the E2F1-E2F4 and NFYA binding sites within 1 kb upstream of the transcription start site in Homo sapiens.

CpG enrichment

When analysing the entire promoter (8kb upstream, 2kb downstream of the TSS) of the target genes of SOX2, POU5F1 and NANOG we found that these regions are strongly enriched in CpG dinucleotides, a very low complexity signal, linked to accessible chromatin. Indeed, it has been shown that CpG dinucleotides can be methylated and consequently tend to undergo deamination, resulting in a cytosine to thymine transition 11. The CpG status in different parts of the genome reflects therefore the landscape of methylation in the germ line cells. CpG methylation in the promoter regions has been shown to be associated with the repression of transcription 11. Hypomethylation of CpG in the promoters of POU5F1 SOX2 and NANOG target genes thus suggest that their promoters are active in the germline. This finding needs further investigation to address the causality of the binding of SOX2, POU5F1 and NANOG to hypomethylated promoters.

References: 1. Cao, Y. et al. Global and gene-specific analyses show distinct roles for Myod and Myog at a common set of promoters. Embo J 25, 502-

11 (2006). 2. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13, 103-7 (2003). 3. Ren, B. et al. E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002). 4. Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004). 5. MacIsaac, K. D. et al. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7, 113 (2006). 6. Hagerman, P. J. Sequence-directed curvature of DNA. Annu Rev Biochem 59, 755-81 (1990). 7. Jensen, L. J. & Knudsen, S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and

functional annotation. Bioinformatics 16, 326-33 (2000). 8. Lorenz, M. C., Cutler, N. S. & Heitman, J. Characterization of alcohol-induced filamentous growth in Saccharomyces cerevisiae. Mol

Biol Cell 11, 183-99 (2000). 9. Matuoka, K. & Yu Chen, K. Nuclear factor Y (NF-Y) and cellular senescence. Exp Cell Res 253, 365-71 (1999). 10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947-56 (2005). 11. Kass, S. U., Pruss, D. & Wolffe, A. P. How does DNA methylation repress transcription? Trends Genet 13, 444-9 (1997). 12. Bergstrom, D. A. et al. Promoter-specific regulation of MyoD binding and signal transduction cooperate to pattern gene expression. Mol

Cell 9, 587-600 (2002). 13. Schreiber, J. et al. Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. Proc Natl

Acad Sci U S A 103, 5899-904 (2006). 14. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene

activation in human tissues. Proc Natl Acad Sci U S A 102, 4459-64 (2005). 15. Odom, D. T. et al. Control of pancreas and liver gene expression by HNF transcription factors. Science 303, 1378-81 (2004). 16. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic

cell growth. Proc Natl Acad Sci U S A 103, 18261-6 (2006).

Supplementary Methods

Motif description

To perform a thorough search for motifs judged significant by our objective function

(see below for definition) the algorithm previously published 1 was used with minor

changes. The input to the initial deterministic pattern scan is a set of foreground

sequences, containing the potential signals of interest, and a set of random

background sequences. As the composition of the background sequence set is of great

importance, the user may supply this, otherwise pre-created sets of randomly sampled

genomic sequences are available. The initial pattern scan explores all words

containing up to a maximum number of degenerate IUPAC nucleotide characters,

excluding triply redundant characters (HBVD). To approximately differentiate the

difference in information content between the four-fold degenerate (N) and two-fold

degenerate characters (RYWSMK), the two-fold degenerate characters were scored

equivalent to one, and N equivalent to two, the resulting degeneracy sum is thus the

limiting factor in pattern enumeration. The pattern search is not limited by word

length, but enumeration is stopped when the total number of occurrences (including

overlaps) falls below a prescribed threshold.

Over-representation assessment

The over-representation of these discrete pattern motifs (discrete-motifs) is assessed

by comparing the numbers of occurrences of motifs in the sample sequences versus a

background sequence set composed of a much larger number of randomly picked

sequences. This is done by using the standard normal-approximation to the binomial

to calculate the z-score. The function does not correct for the effects of overlapping

instances, experiments showed this to have no substantial effect on the results, while

making it much faster to compute (data not shown). To ascertain a sensible threshold

for z-score significance, a randomisation procedure can be repeatedly run (default 20

runs), plotted and the mean score plus two standard deviations taken as the default

threshold. The random sequences are sampled with replacement and used to in-place

of the foreground sequences. The given pattern set is then evaluated and the highest

scoring pattern taken as the result of the run. Though this has certain heuristic

qualities, as an approximation scheme it behaves robustly.

Pattern enumeration is performed depth first using a suffix tree. The suffix tree

implementation is based upon the memory efficient scheme 2. This scheme has the

advantage that large volumes of sequence data (more than 100 megabases) can be

stored on current desktop machines with only 1.5 gigabytes of memory. The output of

this deterministic pattern scan is then the complete set of significant patterns.

Clustering step

To decompose the set of overlapping and redundant putative motifs into a set of

position weight matrices (PWMs) a greedy cluster finding algorithm was employed

upon an undirected graph of motifs (the vertices) and their similarities (edges). Edges

are created in an all-against-all procedure according to the degree of sequence overlap

(which must be more than 70 percent) of motif instances in the sample sequences.

Hits in both strands are analysed to take in account reverse complemented motifs.

Each connected subgraph forms a family and the families are further resolved into

clusters using essentially two methods: [1] the most connected nodes and its directly

connected nodes are removed from the graph to form one cluster. The operation is

iterated until no cluster can be found. Clusters from the same initial subgraph

constitute a family. [2] The percentage overlap for all the motifs constituting a family

is calculated as describe above. The resulting 2D matrix is clustered using the

following distance function: correlation, absolute value of the correlation, uncentered

correlation, absolute uncentered correlation, Spearman’s rank correlation, Kendall’s

tau, Eucleadian distance, city-block distance. The SOM algorithm is then performed 3.

The operation is repeated until the initial connected graph contains no connected

nodes anymore.

Conservation analysis

In order to locate the functional instances of the resulting PWM(s), pairwise

alignments are done using orthologous sequences of the sample sequences using

Blastz 4 with relaxed parameters (K=1000 C=2 P=1 W=6 T=0). Motifs that match the

PWM(s) found are located in the resulting multiple alignments and potential positions

are ranked according to the conservation score. The conservation score is calculated

as the number of cases the motif is found at a particular position in the alignment. For

example if the motif is conserved within 4 species including the reference sequence

the conservation score will be 3 (4 minus the reference sequence).

For the relative conservation score: Let mr,k be the motif at position k in the reference

species r. Let l be the length of the motif and L the total length of the alignment. Pm,s,k

=1 if presence of the motif m in species s at position k in the alignment and Pm,s,k =0

otherwise. The absolute score for motif mr,k is :

€

Smrk = Pm, s, k

i≠r∑

The average score for the alignment is :

€

Ar =Smri

i=1

i= L− l+1∑L − l +1

The relative score for the motif mr,k is :

€

RSmks =SmrkAr

Web interface

The complete analysis, including the alignments is summarized on a web interface

that allows the user to navigate in a sequence or motif centric view. Upon clicking on

a specific PWM, all information concerning the PWM is displayed, including the

locations of the sequence instances used to build the PWM. The over-represented

motifs can also be visualized according to the sequence in which they occur. The

motifs are highlighted within the sequence of interest, or within a multiple sequence

alignment if homologs are provided. Alignments are displayed using the Jalview

program 5, the PWMs logos were drawn using the Weblogo program 6.

Trawler parameters

Trawler default parameters are the following: Minimum motif instances in the sample

sequences (10); Maximum number of mismatch (2); minimum motif length (6 base

pairs (bp)); maximum motif length (20 bp); For all the ChIP analysis the default

parameters were used unless otherwise stated. For the first assessment 7, due to the

small sample size, the minimum motif instances in the sample sequences has been

lowered to 3, the other parameters remain the same.

Secondary motif assessment

In order to test if the over-represented motif also co-occurs with the primary motif,

the sequence around the primary motif was fetched (200 bp for yeast, 1000 bp for

mammals) and over-representation was assessed. If the secondary motif was found

over-represented in the vicinity of the primary motif, then both motifs co-occur also at

a sequence level. In yeast, from the 54 data sets correctly analysed by Trawler, 19

(35.2 %) have secondary motifs (Supplementary Fig. 4). For the mammalian data

set, 4 data sets out of 10 (40%) have secondary motifs (Supplementary Fig. 5).

Algorithms used for speed comparison

The algorithms were tested on a GNU/Linux operating system with i686 processor. In

all cases, the algorithms were run with parameters that were the closest to the Trawler

default parameters used for the mammalian analysis.

AlignACE: The program was run with the defaults parameters and with two different

values for the number of columns to align (n=5 and n=10).

Meme: The program was run with the following parameters: -mod anf -nmotifs 100 -

minw 3 -maxw 20 -revcomp

Motifcut: The parameters were: motif size 6 mers / cluster number 3.

Weeder: The program was run with the « small » option (quick mode, searches for

motif of length 6 bp and 8 bp) and the « large » option (searches for motifs of length 6

bp, 8 bp, 10 bp and 12 bp).

For all progams, the graphical output of the PWM has been drawn using the SeqLogo

program6.

Sequences retrieval and analysis

S.cerevisiae analysis:

Sequences corresponding to the different data sets as well as the background were

downloaded from the website (http://fraenkel.mit.edu/Harbison/release_v24/ FASTA

formatted files) and repeat masked for low complexity repeat using Repeat masker

(http://repeatmasker.org). The background sequences used correspond to sequences

for all probes on the 6k microarray

(http://fraenkel.mit.edu/Harbison/release_v24/yeast_Young_6k.fsa) and repeat

masked for low complexity repeat using Repeat masker. All the experiments were

analysed by Trawler using the default parameters.

Mammalian analysis:

E2F1 – E2F4 analysis: The human EnsEMBL ID of the previously published data of

Ren et al. 8 were retrieved and the 700 bp upstream and 200 bp downstream sequence

of the annotated genes start site (EnsEMBL version 38 9) were repeat and exon

masked. The sample sequences correspond to the promoter region of genes in their

Table 3 and the background sequences correspond to the promoter region of genes in

their supplemental research data

(http://www.genesdev.org/cgi/content/full/16/2/245/DC1).

For Fig. 1b, a randomized data set composed of the same number of sequences as in

the sample set was produced by randomly picking sequences from the background.

Trawler was run with the default parameters.

Myod/Myog: The mouse genes targeted by either Myod or Myog in both MDER and

C2C12 cells are found here

(http://www.nature.com/emboj/journal/v25/n3/extref/7600958s3.pdf). The 750 bp

upstream and 250 bp downstream sequences of the annotated Transcription start site

(TSS) (EnsEMBL version 38 9) were repeat and exon masked.

HNF4A and ONECUT1 (HNF6): All the data were downloaded from

(http://jura.wi.mit.edu/young_public/autoregulation/downloaddata.html). The 750 bp

upstream and 250 bp downstream sequences of the annotated TSS (EnsEMBL version

38 9) were repeat and exon masked.

POU5F1 (OCT4), SOX2 and NANOG analysis: Three data sets coming from the

same experiment 10 have been analysed. First the exact loci where the transcription

factor (TF) has been bound were extracted from MacIsaac et al. 11. Secondly, the

entire promoter regions (10kb) that include the bound loci have also been analysed

using the data from10.

NOTCH1: The target genes were taken from the supporting Table 1 of 12 and the 3 kb

upstream human sequences were retrieved from EnsEMBL 41 (repeat masked

sequences). The background corresponds to 3 kb sequences upstream of 2000

randomly picked genes. Trawler was run with the default parameters (but minimum

occurrence of motifs 20).

CREB1: The data were downloaded from 16. The first 200 genes recorded in their

Supplementary Table 6 were used. The 750 bp upstream and 250 bp downstream

sequences of the annotated TSS (EnsEMBL version 38 9) were repeat and exon

masked.

NF-kappa-B: The data were downloaded from (http://web.wi.mit.edu/young/nfkb/).

The 700 bp upstream and 200 bp downstream sequences of the annotated TSS

(EnsEMBL version 38 9) were repeat and exon masked.

The regions used in the Trawler analysis correspond to regions described in the ChIP

of the corresponding analysis. The background corresponds to the upstream region of

appropriate length of 2000 randomly picked genes. Trawler was run with the default

parameters.

References:

1. Ettwiller, L. et al. The discovery, positioning and verification of a set oftranscription-associated motifs in vertebrates. Genome Biol 6, R104 (2005).

2. Kurtz, S. Reducing the space requirements of suffix trees. Software-PractiseExperience 29, 1149:1171 (1999).

3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysisand display of genome-wide expression patterns. Proc Natl Acad Sci U S A95, 14863-8 (1998).

4. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13,103-7 (2003).

5. Clamp, M., Cuff, J., Searle, S. M. & Barton, G. J. The Jalview Java alignmenteditor. Bioinformatics 20, 426-7 (2004).

6. Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: asequence logo generator. Genome Res 14, 1188-90 (2004).

7. Tompa, M. et al. Assessing computational tools for the discovery oftranscription factor binding sites. Nat Biotechnol 23, 137-44 (2005).

8. Ren, B. et al. E2F integrates cell cycle progression with DNA repair,replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002).

9. Birney, E. et al. Ensembl 2006. Nucleic Acids Res 34, D556-61 (2006).10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human

embryonic stem cells. Cell 122, 947-56 (2005).11. Macisaac, K. D. et al. A hypothesis-based approach for identifying the binding

specificity of regulatory proteins from chromatin immunoprecipitation data.Bioinformatics 22, 423-9 (2006).

12. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth. ProcNatl Acad Sci U S A 103, 18261-6 (2006).

trawler: de novo regulatory motif discovery pipeline for ... · the gat1 motif does not occur often...

Documents