systematic discovery and characterization of regulatory...

30
Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments Pouya Kheradpour 1,2 and Manolis Kellis 1,2, * 1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, USA and 2 Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02139, USA Received August 7, 2013; Revised November 6, 2013; Accepted November 7, 2013 ABSTRACT Recent advances in technology have led to a dramatic increase in the number of available tran- scription factor ChIP-seq and ChIP-chip data sets. Understanding the motif content of these data sets is an important step in understanding the underlying mechanisms of regulation. Here we provide a sys- tematic motif analysis for 427 human ChIP-seq data sets using motifs curated from the literature and also discovered de novo using five established motif discovery tools. We use a systematic pipeline for calculating motif enrichment in each data set, providing a principled way for choosing between motif variants found in the literature and for flagging potentially problematic data sets. Our analysis confirms the known specificity of 41 of the 56 analyzed factor groups and reveals motifs of potential cofactors. We also use cell type- specific binding to find factors active in specific conditions. The resource we provide is accessible both for browsing a small number of factors and for performing large-scale systematic analyses. We provide motif matrices, instances and enrichments in each of the ENCODE data sets. The motifs dis- covered here have been used in parallel studies to validate the specificity of antibodies, understand cooperativity between data sets and measure the variation of motif binding across individuals and species. INTRODUCTION Chromatin immunoprecipitation (ChIP) (1) followed by hybridization to an array (ChIP-chip) (2,3) or sequencing (ChIP-seq) (4) enables the genome-wide identification of the binding locations of transcription factors (TFs) present in a given condition and cell type or tissue. As these technologies have matured, their use has become increasingly widespread. The resolution of these experi- mental techniques can be as low as 300 bp for ChIP-chip (5) and 50 bp for ChIP-seq (6), depending on the experi- mental design (e.g. fragment size, paired-end sequencing) and algorithmic processing of the raw data. The use of these technologies on a variety of factors across many cell types has increasingly highlighted the complex nature of TF activity, often violating the simple model of a factor binding to its recognition pattern (motif) in isolation: binding has been shown to be dynamic across cell types, requiring the coordinated binding of cofactors or specific configurations of the underlying chromatin. Moreover, TF binding frequently occurs in the absence of any discernible motif instance (7,8) or to ‘hot-spots’ where several factors are simultaneously found (9). Understanding this complex binding necessitates identify- ing the underlying sequence features responsible. To address this need, we have performed a systematic, motif-centric analysis of hundreds of TF binding experi- ments made available as part of the human ENCODE project (8,10). As part of this, we provide a collection of motifs for each assayed factor, both taken from the litera- ture and through de novo discovery, and also an annota- tion of motif instances genome-wide, which may be used to pinpoint the specific regulatory bases in regions bound by TFs. We found that no single algorithm or database compre- hensively assays the motifs relevant to the binding diver- sity surveyed by ENCODE. Therefore, our approach was to collect motifs from several literature sources (11–16) and supplement them with motifs discovered de novo on the data sets themselves using five established tools (17–21). Although this general approach of using multiple motif discovery tools is popular [e.g. (22–24)], its application to this number of data sets is unprecedented and permits the identification of TFs that are likely to be interacting or participating in common pathways. *To whom correspondence should be addressed. Tel: +1 617 253 2419; Fax:+1 617 452 5034; Email: [email protected] Nucleic Acids Research, 2013, 1–12 doi:10.1093/nar/gkt1249 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Nucleic Acids Research Advance Access published December 13, 2013 at MIT Libraries on December 14, 2013 http://nar.oxfordjournals.org/ Downloaded from

Upload: others

Post on 27-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Systematic discovery and characterization ofregulatory motifs in ENCODE TF bindingexperimentsPouya Kheradpour12 and Manolis Kellis12

1Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 32 Vassar StCambridge MA 02139 USA and 2Broad Institute of MIT and Harvard 7 Cambridge Center Cambridge MA02139 USA

Received August 7 2013 Revised November 6 2013 Accepted November 7 2013

ABSTRACT

Recent advances in technology have led to adramatic increase in the number of available tran-scription factor ChIP-seq and ChIP-chip data setsUnderstanding the motif content of these data setsis an important step in understanding the underlyingmechanisms of regulation Here we provide a sys-tematic motif analysis for 427 human ChIP-seq datasets using motifs curated from the literature andalso discovered de novo using five establishedmotif discovery tools We use a systematicpipeline for calculating motif enrichment in eachdata set providing a principled way for choosingbetween motif variants found in the literature andfor flagging potentially problematic data sets Ouranalysis confirms the known specificity of 41 ofthe 56 analyzed factor groups and reveals motifsof potential cofactors We also use cell type-specific binding to find factors active in specificconditions The resource we provide is accessibleboth for browsing a small number of factors andfor performing large-scale systematic analyses Weprovide motif matrices instances and enrichmentsin each of the ENCODE data sets The motifs dis-covered here have been used in parallel studies tovalidate the specificity of antibodies understandcooperativity between data sets and measure thevariation of motif binding across individuals andspecies

INTRODUCTION

Chromatin immunoprecipitation (ChIP) (1) followed byhybridization to an array (ChIP-chip) (23) or sequencing(ChIP-seq) (4) enables the genome-wide identification ofthe binding locations of transcription factors (TFs)

present in a given condition and cell type or tissue Asthese technologies have matured their use has becomeincreasingly widespread The resolution of these experi-mental techniques can be as low as 300 bp for ChIP-chip(5) and 50 bp for ChIP-seq (6) depending on the experi-mental design (eg fragment size paired-end sequencing)and algorithmic processing of the raw dataThe use of these technologies on a variety of factors

across many cell types has increasingly highlighted thecomplex nature of TF activity often violating the simplemodel of a factor binding to its recognition pattern (motif)in isolation binding has been shown to be dynamic acrosscell types requiring the coordinated binding of cofactorsor specific configurations of the underlying chromatinMoreover TF binding frequently occurs in the absenceof any discernible motif instance (78) or to lsquohot-spotsrsquowhere several factors are simultaneously found (9)Understanding this complex binding necessitates identify-ing the underlying sequence features responsible Toaddress this need we have performed a systematicmotif-centric analysis of hundreds of TF binding experi-ments made available as part of the human ENCODEproject (810) As part of this we provide a collection ofmotifs for each assayed factor both taken from the litera-ture and through de novo discovery and also an annota-tion of motif instances genome-wide which may beused to pinpoint the specific regulatory bases in regionsbound by TFsWe found that no single algorithm or database compre-

hensively assays the motifs relevant to the binding diver-sity surveyed by ENCODE Therefore our approach wasto collect motifs from several literature sources (11ndash16)and supplement them with motifs discovered de novo onthe data sets themselves using five established tools(17ndash21) Although this general approach of usingmultiple motif discovery tools is popular [eg (22ndash24)]its application to this number of data sets is unprecedentedand permits the identification of TFs that are likely to beinteracting or participating in common pathways

To whom correspondence should be addressed Tel +1 617 253 2419 Fax +1 617 452 5034 Email manolimitedu

Nucleic Acids Research 2013 1ndash12doi101093nargkt1249

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc30) which permits non-commercial re-use distribution and reproduction in any medium provided the original work is properly cited For commercialre-use please contact journalspermissionsoupcom

Nucleic Acids Research Advance Access published December 13 2013 at M

IT L

ibraries on Decem

ber 14 2013httpnaroxfordjournalsorg

Dow

nloaded from

This work is accompanied by a web interface forbrowsing the discovered and literature motifs along withtheir enrichments (Figure 1 httpcompbiomiteduencode-motifs) In addition to the browsing interface weprovide several data files including all motif matrices andtheir matches to the genome as well as software tocompute enrichments and perform unified motif discoverywith the five tools we use Together these permit bothanalyses of individual factors (eg to identify cooperatingTFs) in addition to systematic analysis (eg to examinedifferences between TFs) Moreover the breadth of datasets available enables systematic comparisons andanalyses that are not possible when only one or a fewfactors are studied in isolationLater in the text we describe the details of how the

resource was generated and conduct an initial analysis toprovide examples of its usage and to highlight potentiallyinteresting results

MATERIALS AND METHODS

Our goals were to produce a resource that (i) contains acomprehensive collection of relevant motifs for eachfactor (ii) avoids repetitive weakly enriched motifs thatdo not contribute to the in vivo specificity of the factor orits partners and (iii) excludes variants of the same motifparticularly among the discovered motifs With this inmind we conducted motif discovery separately on eachdata set using five motif discovery tools and manuallyplaced all its data sets into lsquofactor groupsrsquo on the basisof known motifs and homology (Figure 2) Knownmotifs from the literature and the top 10 most enricheddiscovered motifs (excluding duplicates) were collected foreach factor group (see Supplementary Methods) andnamed as TF_known for known motifs and TF_discfor discovered motifs where TF denotes the factorgroup (eg FOXA CTCF etc) Known motifs wereordered arbitrarily whereas the discovered motifs wereordered in descending order of the enrichment value thatwas used for their selectionThe 427 ENCODE experiments analyzed correspond to

123 TFs which we place into 84 factor groups (Figure 3a)We failed to discover an enriched motif for only 12 of the84 factor groups of which 9 lack DNA binding domains(BRF CTBP2 HDAC8 KAT2A NELFE SUPT20HSUZ12 WRNIP1 and XRCC4) as identified by UniProt(27) and 6 have all their data sets flagged as unreliablebased on various quality metrics [BRF KAT2A NELFENR4A SUPT20H and ZZZ3 see (A Kundaje LY JungPV Kharchenko B Wold A Sidow S Batzoglou andPJ Park in preparation)] Of these factor groups onlyNR4A has a previously identified known motifWe exclude from the discussion below motifs that we

consider unlikely to be relevant to our analysis whilemaintaining them as part of the overall resource wherethey may be useful These include 46 discovered motifsthat are either low-complexity (eg dinucleotide repeats)or consistently have weak enrichment (lt2) and do notmatch known motifs (Supplementary Table S1) Theseare likely a consequence of slight biases in the discovery

pipeline or are due to real but relatively weak specificityfor the factor We also exclude an additional 36 motifsthat have a weak similarity to the known motif for thefactor but for which a better matching and enriched motifis also found (Supplementary Table S2) These are mostfrequently seen for longer motifs that can be broken upinto recognizable but globally dissimilar patterns that arenot captured by our automatic exclusion criteria (seeSupplementary Methods) Together these represent 28of the 293 discovered motifs

RESULTS

Using motif similarity metrics we are able to link the dis-covered motifs directly to the TFs that recognize themthrough their known motifs Here we use these inferredrelationships between TFs to make specific biologicalinsights illustrating the types of analyses that ourresource enables In the interest of clarity most descrip-tions of TFs will be omitted but may be found along withfurther references at RefSeq (28) and Entrez (29)

Recovery of known specificity for TFs

Most of the known literature motifs we collect are derivedfrom biochemical in vitro assays Thus they provide alargely independent although somewhat imperfect wayto evaluate the performance of our discovered motifsRecovery of known motifs varies significantly bymethod but taking the most enriched motif (ourpipeline) is competitive with the best single method(Figure 3b) Overall our pipeline found a motifmatching a previously characterized literature motif for41 of the 56 factor groups with a known motif

One of the most striking observations of this analysis ishow frequently other distinct motifs were also found For29 of these 41 factor groups other motifs are found evenafter manually excluding redundant or repetitive motifsand for 9 factor groups one or more of these discoveredmotifs is ranked higher than the motif matching a knownmotif (see Supplementary Table S3) In the next sectionwe will analyze the additional motifs we found for thesefactors which in many cases identify factors known tointeract either cooperatively or competitively

For the remaining 15 of 56 factor groups with a knownmotif (eg HSF NANOG PBX3 SREBP and TAL1) theknown motif is not found at all including NR4A whereno enriched motif is discovered Frequently this is becausethe known motif itself is not enriched and may not accur-ately capture the specificity of the factor in vivo Forexample the lsquoknownrsquo EP300 motif from Transfac waslikely built on a specific bound region of EP300 andwould not accurately capture its binding in all cell typeswhere it interacts with a variety of factors and has noDNA binding domain of its own (we avoided removingsuch motifs to prevent bias in the database) Likewise wedo not discover a motif that matches the known ZBTB33specificity and moreover the known motif itself is notenriched at all in the bound regions

Although some known motifs were of apparently lowquality we largely found our database of known motifs to

2 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

56

All Dist

alN

ear

TSS

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1FOX_1FOXD3_1FOXD3_2FOXC2_2FOXC1_6FOXL1_5FOXC1_3FOXL1_1FOXF1_1FOXQ1_1FOXQ1_2SRY_3FOXJ2_5FOXK1_1FOXL1_3FOXJ3_1FOXF2_2FOXF2_1FOXD1_1FOXC2_3FOXJ3_2FOXO3_2FOXJ1_2FOXO3_1FOXO4_2FOXO1_2FOXO3_3FOXG1_5FOXO1_3FOXO4_4FOXO6_2FOXI1_3FOXD1_2FOXO3_5FOXB1_4FOXD3_4TCF12_disc2FOXD2_2FOXC1_7FOXL1_4FOXK1_4FOXP3_2FOXJ2_4FOXJ3_7HDAC2_disc2EP300_disc3FOXO1_1FOXO4_1FOXC1_1FOXC1_5FOXB1_3FOXB1_2FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1HNF4_known9HNF4_known6HNF4_known13HNF4_known2 RXRA_disc1NR4A_known4RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

03

03

03

03

03

03

03

04

04

03

03

04

04

04

04

03

04

04

04

04

05

05

04

04

04

04

04

04

04

04

05

05

05

06

06

05

05

05

05

05

04

05

05

05

05

05

05

05

04

04

05

05

03

04

04

03

03

02

02

02

02

03

05

04

04

03

02

03

03

03

03

04

04

04

04

04

05

10

08

07

06

07

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

09

08

09

09

09

09

09

08

08

09

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

08

08

09

09

07

07

06

06

06

05

05

04

04

03

04

03

04

04

04

04

04

04

04

04

04

05

04

02

08

06

07

08

08

08

07

07

07

07

07

07

07

08

07

07

07

07

09

09

09

09

09

09

09

08

08

08

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

08

10

10

08

08

08

09

09

07

06

06

06

06

04

05

05

04

03

04

03

04

03

03

03

04

04

04

04

04

04

03

02

08

07

08

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

08

08

08

08

09

09

09

08

08

08

08

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

09

10

10

08

08

07

09

08

06

06

06

06

06

04

05

04

04

03

04

04

04

03

03

03

03

03

03

04

04

04

03

03

06

08

07

08

08

08

07

07

08

08

08

08

08

08

08

08

08

08

10

10

10

09

09

09

10

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

09

08

09

09

09

09

09

09

09

08

08

09

09

09

07

06

06

06

06

04

05

04

04

03

03

03

03

03

03

03

04

04

04

04

04

04

03

01

08

08

08

08

09

09

08

08

08

08

08

08

07

08

08

07

07

07

08

08

08

08

08

08

09

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

09

09

08

08

08

09

08

07

07

07

07

07

06

05

04

04

03

05

04

03

03

03

03

03

03

03

03

03

03

03

02

08

09

09

09

09

09

09

09

09

09

09

08

08

09

08

08

06

06

07

08

08

08

08

08

09

08

08

08

07

07

07

07

08

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

07

06

06

06

06

06

05

05

04

04

04

05

04

04

03

03

03

03

03

03

02

03

02

02

02

08

08

09

09

09

09

09

08

09

08

08

08

08

08

08

07

06

06

07

08

08

07

07

08

09

08

07

08

07

07

07

07

08

08

08

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

08

06

06

06

06

06

05

05

04

04

04

05

04

04

04

03

03

03

03

03

03

03

02

02

02

07

08

08

09

09

09

09

09

08

08

08

08

08

08

08

08

06

06

07

07

07

07

07

08

09

07

07

07

07

06

07

07

08

07

07

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

08

07

07

07

09

08

07

07

07

07

07

06

05

04

04

03

04

03

03

03

03

03

03

03

03

03

03

02

02

01

05

03

05

05

06

05

06

06

05

05

05

05

05

05

05

05

04

06

06

06

06

06

06

06

07

06

05

05

05

05

05

06

07

06

07

06

07

07

06

07

07

07

07

06

07

07

06

06

06

07

06

05

05

06

08

08

08

09

09

09

09

08

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

02

02

03

03

04

04

04

04

04

04

04

04

03

04

04

03

04

04

04

05

05

04

04

05

05

04

05

04

04

04

04

04

04

04

04

04

04

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

04

03

03

05

04

04

04

04

03

03

03

04

03

04

05

08

08

08

08

08

08

08

08

08

08

08

08

03

02

04

03

04

04

03

03

04

04

03

03

03

03

04

03

03

04

04

03

03

02

03

03

02

02

03

04

04

02

03

03

03

03

03

03

03

03

03

02

02

03

04

02

04

03

03

03

03

03

03

03

03

03

03

03

03

03

03

04

04

04

04

05

08

08

08

10

04

05

05

05

06

06

06

05

05

05

04

04

04

02

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

37

18

20

37

36

34

29

17

29

28

19

11

18

55

48

49

60

70

106

58

72

34

74

43

63

58

87

44

77

17

511

78

84

68

37

23

28

81

74

85

59

44

33

55

52

15

13

11

26

46

57

36

44

44

36

19

27

28

22

29

10

13

25

33

22

41

32

39

18

21

38

37

36

31

17

29

29

19

12

18

58

50

52

60

73

106

88

72

44

84

63

73

79

47

74

67

47

911

79

88

70

37

23

29

83

76

88

61

45

33

57

52

15

13

11

29

49

57

37

46

45

37

19

26

28

25

28

10

12

29

36

23

45

35

42

19

21

41

39

37

33

16

28

30

20

12

19

63

53

55

66

76

117

59

32

55

34

83

83

69

77

85

07

58

413

84

91

79

39

23

30

90

83

98

71

54

40

69

62

17

11

11

29

48

58

38

47

47

38

19

29

30

24

30

10

13

21

34

22

46

34

35

19

22

36

39

37

32

18

32

27

18

11

20

60

51

53

60

72

95

67

93

27

51

52

40

42

108

94

58

68

211

96

106

93

82

53

18

37

28

75

84

23

35

35

01

51

21

51

01

01

01

01

01

01

01

01

01

01

01

01

01

2

19

28

21

48

30

45

14

17

39

39

35

31

13

26

28

15

12

16

76

60

70

70

96

158

412

29

60

68

51

43

1113

33

139

115

1614

82

40

28

36

129

19

93

11

81

32

12

01

01

11

91

01

01

01

01

01

01

01

01

01

01

01

01

51

4

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

25

33

22

41

32

29

36

23

45

35

21

34

22

46

34

19

28

21

48

30

44

33

55

52

15

13

11

26

46

45

33

57

52

15

13

11

29

49

54

40

69

62

17

11

11

29

48

42

33

53

50

15

12

15

10

10

18

13

21

20

10

11

19

10

10

27

28

22

29

10

13

26

28

25

28

10

12

29

30

24

30

10

13

10

10

10

10

10

12

10

10

10

10

15

14

FOXA

_kno

wn2

FOXA

_kno

wn3

FOXA

_kno

wn5

FOXA

_kno

wn6

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_disc

2

FOXA

_disc

3

FOXA

_disc

4

FOXA

_disc

5

FOXA

_disc

1

FOXA

_kno

wn1

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1

04

03

03

03

03

07

06

07

08

08

08

06

07

08

08

08

07

08

08

08

06

08

07

08

08

08

08

08

08

09

08

09

09

09

09

08

08

09

09

09

07

08

08

09

09

05

03

05

05

06

03

03

04

04

04

04

03

04

04

03

FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1

03

02

02

02

02

03

05

04

04

03

02

07

06

06

06

05

05

04

04

03

04

03

06

06

06

06

04

05

05

04

03

04

03

06

06

06

06

04

05

04

04

03

04

04

06

06

06

06

04

05

04

04

03

03

03

07

07

07

07

06

05

04

04

03

05

04

06

06

06

06

05

05

04

04

04

05

04

06

06

06

06

05

05

04

04

04

05

04

07

07

07

07

06

05

04

04

03

04

03

09

09

09

09

08

04

04

04

04

03

04

04

04

03

03

03

04

03

04

05

08

08

04

04

04

04

05

08

08

08

10

04

05

RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

04

04

05

10

08

04

04

04

05

04

02

04

04

04

04

03

02

03

04

04

04

03

03

04

04

04

04

03

01

03

03

03

03

03

02

03

02

03

02

02

02

03

03

03

02

02

02

03

03

03

02

02

01

04

04

04

04

02

02

08

08

08

08

03

02

05

05

04

04

04

02

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

(a)

(d)

(e)

(b)

(c)

Figure

1PipelineoutputforFOXA

factorgroupanexample

thathighlights

differentaspects

oftheresource(intheinterest

ofspaceonly

selected

columnsare

shownin

theenlargem

ents)(a)

Theknownanddiscovered

motifs

forthefactorgroupdrawnwithWebLogo(25)Because

theoriginalorientationis

arbitrarymotifs

are

flipped

asnecessary

toincrease

thesimilarity

ofthe

displayed

orientations

(bandc)

Sim

ilarity

betweenthemotifs

forthis

factorgroupandthemotifs(b)forthis

factorgroup(c)forallother

motifs

thatmatchatleast

onemotifforthis

factor

groupwithsimilarity

075Motifsimilarities

are

shownin

blackw

hitescale

withgraystartingat065(d)Enrichments

ofmotifs

inthedata

sets

forthis

factorgroupData

sets

are

named

indicatingthefactor

celltypestage

labcode

followed

byvalues

todifferentiate

data

sets

andmakethem

unique

Red

isusedforenrichmentblueis

usedfordepletion(w

hichis

rare

inthis

study)Enrichments

are

notshownformotifs

when

controlmotifs

could

notbegeneratedortherewasnotsufficientinform

ationcontentto

matchata48P-value

(e)Magnified

enrichment

heatm

apvalue

Thethreetrianglesrepresentthebackgroundregionsusedforenrichmentthetoptriangle

isallregions

theleft

triangle

isthose

within

2kbofanannotatedTSSandtheright

triangle

isthose

outsidethe2kbwindow

(theregionsusedin

theleftandrighttrianglespartitionthatofthetoptriangle)Thenumber

shownistheenrichmentin

theinclusivebackgroundHere

weseeanapparentcontradictionahigher

enrichmentfortheunionthanthepartsThisoccurs

because

thehigher

counts

permitasm

aller

confidence

intervalaroundtheratiosusedto

compute

theenrichmentHeatm

apsare

ordered

usinghierarchicalclusteringfollowed

byoptimalleafordering(26)TheSupplementary

Resultsincludeananalysisoftherobustnessofthemotifsimilarity

metricandtheenrichmentvalues

Nucleic Acids Research 2013 3

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

be relatively comprehensive and had difficulty findingmatches to novel motifs outside it An exception isZNF263_disc1 which does not match a motif in ourdatabase but does roughly match the specificity forZNF263 indicated in (30) despite only having weak en-richment (18-fold)Although the motifs that match each other (either

known or discovered) generally have similar enrichmentsin some cases we find substantially higher enrichment forsome motif variants over others (Figure 4 andSupplementary Table S3) For example NFE2_disc1matches the known NFE2 motif but has a 76-foldmaximal enrichment across NFE2 data sets comparedwith 56-fold enrichment for the most enriched knownNFE2 motif Different known motifs for the same factoroften show a broad range in enrichment MEF2 has sixmotifs described in Transfac with an enrichment differen-tial of as much as 4-fold consistently across data sets Thisenrichment analysis provides a systematic way to chooseamong variants of a motifWe also saw varying enrichment of the known motif

depending on the specific data set for a factor group Forexample CTCF_known2 is enriched in CTCF data sets ina range from 30- to 78-fold on identically processed dataThis may be a result of varying quality of the samplesacross data sets or may be a consequence of true biologicaldifferencesIdentifying the sequence specificity for factors that were

previously uncharacterized is of particular interest In all17 factor groups had no known motif but now have dis-covered enriched motifs (BCL BDP1 CCNT2 CHD2CTCFL HDAC2 HMGN3 RAD21 SETDB1 SIRT6SMARC SMC3 SP2 SIN3A THAP1 TRIM28 andZNF263) These discovered motifs may represent thedirect or indirect (eg through cofactors) DNA bindingspecificity

Shared motifs suggest interacting relationships

We find that most factors have motifs for other factorsenriched in their binding sites (summarized inSupplementary Table S4) This may occur due to (i) co-operative binding of the two factors to the same locations(ii) interfering binding between factors where one bindsnear the other to prevent binding (iii) some similarity inmotif specificity (iv) the two factors functioning on asimilar set of genes (eg ones specific to one tissue)without directly interacting or (v) the factors binding tosimilar genomic regions (eg near genes) Our analysisdoes not directly rule out any of these possibilitieshowever (iii) is generally verifiable using our motif simi-larity metrics and (v) can be examined by inspecting onlythe TSS-proximal enrichment

The motif most enriched in multiple data sets was theTPA DNA response element (TRE TGA[CG]TCA)which is recognized by the AP1 TF when it is formed byFOSJUN dimers (31) and other factors including MAFand NFE2 The enrichment of the TRE in a data set isoften stronger than that of even the known in vitrosequence specificity and may arise from a number of phe-nomena including (i) a cooperatively interaction withAP1 (ii) competition with AP1 for the same bindingsites leading to a potentially repressive role for the TFor (iii) reuse of binding sites due to for example accessi-bility of chromatin We find a motif matching the TREmotif for 20 factor groups (AP1_disc3 AP2_disc1BATF_disc1 BCL_disc2 CTCF_disc8-9 EP300_disc1GATA_disc2 HMGN3_disc1 IRF_disc2 MAF_disc1MEF2_disc3 MYC_disc3 NFE2_disc1 NR3C1_disc2PRDM1_disc2 RXRA_disc3 SMARC_disc1 STAT_disc2 TCF7L2_disc1 and TRIM28_disc1)

We found that the enrichment of the TRE to be par-ticularly notable for a few factors GATA and AP1 have

Figure 2 Outline of motif discovery pipeline Input regions for each data set are randomly partitioned into two groups The top 250 regions of oneof the partitions are scanned for motifs using five de novo motif discovery tools These motifs are evaluated using the peaks from the otherpartitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group

4 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

known cooperative binding (32) TFs in the SMARCfactor group are members of the SWISNF chromatinremodeling complex (33) which is necessary for properregulation by FOSJUN dimers (34) and TCF7L2_

disc1 which matches the TRE is more enriched thanthe known TCF7L2 motif (TCF7L2_disc2) in only theTCF7L2 colorectal cancer cell line HCF-116 data set con-sistent with the known interaction of JUN and TCF7L2during intestinal cancer development (35)AP1 also binds to the cAMP response element (CRE T

GACGTCA) when the dimer is formed by ATF3JUN(31) and this is the motif we find as AP1_disc1However AP1_disc3 (which matches the TRE) is themost enriched motif in FOS data sets InterestinglyATF3_disc1 is not the CRE but rather the E-box (seelater in text) We do however find a variant of theCRE (with additional specificity) as ATF3_disc2 Themost enriched discovered motif for E2F E2F_disc1 alsomatches the CRE and is highly enriched in all data setsMYC is a critical regulator which recognizes the E-box

sequence To aid in comparisons we include MAX whichforms complexes with MYC and USF12 which also rec-ognizes the E-box sequence in the MYC factor group

(a)

(b)

Figure 3 (a) Summary of input data used The outside ring indicatesthe experimental data sets (one tick for each of 427) which areseparated into 123 transcription factors (second ring) The TFs arefurther grouped into 84 factor groups (third ring) We are able tofind a matching discovered motif for 41 of the 56 factor groups witha known motif 29 of these 41 factor groups have additional discoveredmotifs that may be associated with cofactors For all but 1 of the 15factor groups where the known motif is not recovered we still findenriched discovered motifs We also discovered enriched motifs for 17of the 28 factor groups without a known motif (b) Recovery of knownmotifs by each of the discovery tools Performance of discovery interms of number of factor groups for which the known motif was re-covered A motif is considered a match if it matches any of the knownmotifs for a factor group (see Supplementary Methods for details onhow matches are computed) The number of additional factors thathave a match is shown with each additional motif (only three motifsare taken from each individual method whereas we have up to 10 forthe pipeline) The number of factor groups with no motif match isshown in parenthesis When multiple data sets exist for a factorgroup the fraction that matches is used in computing its contributionfor computing the performance of the individual tools

Figure 4 Comparison of known versus discovered motifs (selectedwhere discovered better enriched than known all factor groups witha discovered motif matching a known motif in Supplementary TableS3) Displayed is the known and discovered motif with the maximumenrichment across all data sets for a factor group Only the discoveredmotifs that match a known motif for a factor group are consideredThe maximum enrichment is indicated for each factor and in paren-thesis the lsquorawrsquo enrichment for the same data set without the use of theshuffle motifs for correction

Nucleic Acids Research 2013 5

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 2: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

This work is accompanied by a web interface forbrowsing the discovered and literature motifs along withtheir enrichments (Figure 1 httpcompbiomiteduencode-motifs) In addition to the browsing interface weprovide several data files including all motif matrices andtheir matches to the genome as well as software tocompute enrichments and perform unified motif discoverywith the five tools we use Together these permit bothanalyses of individual factors (eg to identify cooperatingTFs) in addition to systematic analysis (eg to examinedifferences between TFs) Moreover the breadth of datasets available enables systematic comparisons andanalyses that are not possible when only one or a fewfactors are studied in isolationLater in the text we describe the details of how the

resource was generated and conduct an initial analysis toprovide examples of its usage and to highlight potentiallyinteresting results

MATERIALS AND METHODS

Our goals were to produce a resource that (i) contains acomprehensive collection of relevant motifs for eachfactor (ii) avoids repetitive weakly enriched motifs thatdo not contribute to the in vivo specificity of the factor orits partners and (iii) excludes variants of the same motifparticularly among the discovered motifs With this inmind we conducted motif discovery separately on eachdata set using five motif discovery tools and manuallyplaced all its data sets into lsquofactor groupsrsquo on the basisof known motifs and homology (Figure 2) Knownmotifs from the literature and the top 10 most enricheddiscovered motifs (excluding duplicates) were collected foreach factor group (see Supplementary Methods) andnamed as TF_known for known motifs and TF_discfor discovered motifs where TF denotes the factorgroup (eg FOXA CTCF etc) Known motifs wereordered arbitrarily whereas the discovered motifs wereordered in descending order of the enrichment value thatwas used for their selectionThe 427 ENCODE experiments analyzed correspond to

123 TFs which we place into 84 factor groups (Figure 3a)We failed to discover an enriched motif for only 12 of the84 factor groups of which 9 lack DNA binding domains(BRF CTBP2 HDAC8 KAT2A NELFE SUPT20HSUZ12 WRNIP1 and XRCC4) as identified by UniProt(27) and 6 have all their data sets flagged as unreliablebased on various quality metrics [BRF KAT2A NELFENR4A SUPT20H and ZZZ3 see (A Kundaje LY JungPV Kharchenko B Wold A Sidow S Batzoglou andPJ Park in preparation)] Of these factor groups onlyNR4A has a previously identified known motifWe exclude from the discussion below motifs that we

consider unlikely to be relevant to our analysis whilemaintaining them as part of the overall resource wherethey may be useful These include 46 discovered motifsthat are either low-complexity (eg dinucleotide repeats)or consistently have weak enrichment (lt2) and do notmatch known motifs (Supplementary Table S1) Theseare likely a consequence of slight biases in the discovery

pipeline or are due to real but relatively weak specificityfor the factor We also exclude an additional 36 motifsthat have a weak similarity to the known motif for thefactor but for which a better matching and enriched motifis also found (Supplementary Table S2) These are mostfrequently seen for longer motifs that can be broken upinto recognizable but globally dissimilar patterns that arenot captured by our automatic exclusion criteria (seeSupplementary Methods) Together these represent 28of the 293 discovered motifs

RESULTS

Using motif similarity metrics we are able to link the dis-covered motifs directly to the TFs that recognize themthrough their known motifs Here we use these inferredrelationships between TFs to make specific biologicalinsights illustrating the types of analyses that ourresource enables In the interest of clarity most descrip-tions of TFs will be omitted but may be found along withfurther references at RefSeq (28) and Entrez (29)

Recovery of known specificity for TFs

Most of the known literature motifs we collect are derivedfrom biochemical in vitro assays Thus they provide alargely independent although somewhat imperfect wayto evaluate the performance of our discovered motifsRecovery of known motifs varies significantly bymethod but taking the most enriched motif (ourpipeline) is competitive with the best single method(Figure 3b) Overall our pipeline found a motifmatching a previously characterized literature motif for41 of the 56 factor groups with a known motif

One of the most striking observations of this analysis ishow frequently other distinct motifs were also found For29 of these 41 factor groups other motifs are found evenafter manually excluding redundant or repetitive motifsand for 9 factor groups one or more of these discoveredmotifs is ranked higher than the motif matching a knownmotif (see Supplementary Table S3) In the next sectionwe will analyze the additional motifs we found for thesefactors which in many cases identify factors known tointeract either cooperatively or competitively

For the remaining 15 of 56 factor groups with a knownmotif (eg HSF NANOG PBX3 SREBP and TAL1) theknown motif is not found at all including NR4A whereno enriched motif is discovered Frequently this is becausethe known motif itself is not enriched and may not accur-ately capture the specificity of the factor in vivo Forexample the lsquoknownrsquo EP300 motif from Transfac waslikely built on a specific bound region of EP300 andwould not accurately capture its binding in all cell typeswhere it interacts with a variety of factors and has noDNA binding domain of its own (we avoided removingsuch motifs to prevent bias in the database) Likewise wedo not discover a motif that matches the known ZBTB33specificity and moreover the known motif itself is notenriched at all in the bound regions

Although some known motifs were of apparently lowquality we largely found our database of known motifs to

2 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

56

All Dist

alN

ear

TSS

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1FOX_1FOXD3_1FOXD3_2FOXC2_2FOXC1_6FOXL1_5FOXC1_3FOXL1_1FOXF1_1FOXQ1_1FOXQ1_2SRY_3FOXJ2_5FOXK1_1FOXL1_3FOXJ3_1FOXF2_2FOXF2_1FOXD1_1FOXC2_3FOXJ3_2FOXO3_2FOXJ1_2FOXO3_1FOXO4_2FOXO1_2FOXO3_3FOXG1_5FOXO1_3FOXO4_4FOXO6_2FOXI1_3FOXD1_2FOXO3_5FOXB1_4FOXD3_4TCF12_disc2FOXD2_2FOXC1_7FOXL1_4FOXK1_4FOXP3_2FOXJ2_4FOXJ3_7HDAC2_disc2EP300_disc3FOXO1_1FOXO4_1FOXC1_1FOXC1_5FOXB1_3FOXB1_2FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1HNF4_known9HNF4_known6HNF4_known13HNF4_known2 RXRA_disc1NR4A_known4RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

03

03

03

03

03

03

03

04

04

03

03

04

04

04

04

03

04

04

04

04

05

05

04

04

04

04

04

04

04

04

05

05

05

06

06

05

05

05

05

05

04

05

05

05

05

05

05

05

04

04

05

05

03

04

04

03

03

02

02

02

02

03

05

04

04

03

02

03

03

03

03

04

04

04

04

04

05

10

08

07

06

07

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

09

08

09

09

09

09

09

08

08

09

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

08

08

09

09

07

07

06

06

06

05

05

04

04

03

04

03

04

04

04

04

04

04

04

04

04

05

04

02

08

06

07

08

08

08

07

07

07

07

07

07

07

08

07

07

07

07

09

09

09

09

09

09

09

08

08

08

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

08

10

10

08

08

08

09

09

07

06

06

06

06

04

05

05

04

03

04

03

04

03

03

03

04

04

04

04

04

04

03

02

08

07

08

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

08

08

08

08

09

09

09

08

08

08

08

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

09

10

10

08

08

07

09

08

06

06

06

06

06

04

05

04

04

03

04

04

04

03

03

03

03

03

03

04

04

04

03

03

06

08

07

08

08

08

07

07

08

08

08

08

08

08

08

08

08

08

10

10

10

09

09

09

10

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

09

08

09

09

09

09

09

09

09

08

08

09

09

09

07

06

06

06

06

04

05

04

04

03

03

03

03

03

03

03

04

04

04

04

04

04

03

01

08

08

08

08

09

09

08

08

08

08

08

08

07

08

08

07

07

07

08

08

08

08

08

08

09

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

09

09

08

08

08

09

08

07

07

07

07

07

06

05

04

04

03

05

04

03

03

03

03

03

03

03

03

03

03

03

02

08

09

09

09

09

09

09

09

09

09

09

08

08

09

08

08

06

06

07

08

08

08

08

08

09

08

08

08

07

07

07

07

08

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

07

06

06

06

06

06

05

05

04

04

04

05

04

04

03

03

03

03

03

03

02

03

02

02

02

08

08

09

09

09

09

09

08

09

08

08

08

08

08

08

07

06

06

07

08

08

07

07

08

09

08

07

08

07

07

07

07

08

08

08

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

08

06

06

06

06

06

05

05

04

04

04

05

04

04

04

03

03

03

03

03

03

03

02

02

02

07

08

08

09

09

09

09

09

08

08

08

08

08

08

08

08

06

06

07

07

07

07

07

08

09

07

07

07

07

06

07

07

08

07

07

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

08

07

07

07

09

08

07

07

07

07

07

06

05

04

04

03

04

03

03

03

03

03

03

03

03

03

03

02

02

01

05

03

05

05

06

05

06

06

05

05

05

05

05

05

05

05

04

06

06

06

06

06

06

06

07

06

05

05

05

05

05

06

07

06

07

06

07

07

06

07

07

07

07

06

07

07

06

06

06

07

06

05

05

06

08

08

08

09

09

09

09

08

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

02

02

03

03

04

04

04

04

04

04

04

04

03

04

04

03

04

04

04

05

05

04

04

05

05

04

05

04

04

04

04

04

04

04

04

04

04

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

04

03

03

05

04

04

04

04

03

03

03

04

03

04

05

08

08

08

08

08

08

08

08

08

08

08

08

03

02

04

03

04

04

03

03

04

04

03

03

03

03

04

03

03

04

04

03

03

02

03

03

02

02

03

04

04

02

03

03

03

03

03

03

03

03

03

02

02

03

04

02

04

03

03

03

03

03

03

03

03

03

03

03

03

03

03

04

04

04

04

05

08

08

08

10

04

05

05

05

06

06

06

05

05

05

04

04

04

02

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

37

18

20

37

36

34

29

17

29

28

19

11

18

55

48

49

60

70

106

58

72

34

74

43

63

58

87

44

77

17

511

78

84

68

37

23

28

81

74

85

59

44

33

55

52

15

13

11

26

46

57

36

44

44

36

19

27

28

22

29

10

13

25

33

22

41

32

39

18

21

38

37

36

31

17

29

29

19

12

18

58

50

52

60

73

106

88

72

44

84

63

73

79

47

74

67

47

911

79

88

70

37

23

29

83

76

88

61

45

33

57

52

15

13

11

29

49

57

37

46

45

37

19

26

28

25

28

10

12

29

36

23

45

35

42

19

21

41

39

37

33

16

28

30

20

12

19

63

53

55

66

76

117

59

32

55

34

83

83

69

77

85

07

58

413

84

91

79

39

23

30

90

83

98

71

54

40

69

62

17

11

11

29

48

58

38

47

47

38

19

29

30

24

30

10

13

21

34

22

46

34

35

19

22

36

39

37

32

18

32

27

18

11

20

60

51

53

60

72

95

67

93

27

51

52

40

42

108

94

58

68

211

96

106

93

82

53

18

37

28

75

84

23

35

35

01

51

21

51

01

01

01

01

01

01

01

01

01

01

01

01

01

2

19

28

21

48

30

45

14

17

39

39

35

31

13

26

28

15

12

16

76

60

70

70

96

158

412

29

60

68

51

43

1113

33

139

115

1614

82

40

28

36

129

19

93

11

81

32

12

01

01

11

91

01

01

01

01

01

01

01

01

01

01

01

01

51

4

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

25

33

22

41

32

29

36

23

45

35

21

34

22

46

34

19

28

21

48

30

44

33

55

52

15

13

11

26

46

45

33

57

52

15

13

11

29

49

54

40

69

62

17

11

11

29

48

42

33

53

50

15

12

15

10

10

18

13

21

20

10

11

19

10

10

27

28

22

29

10

13

26

28

25

28

10

12

29

30

24

30

10

13

10

10

10

10

10

12

10

10

10

10

15

14

FOXA

_kno

wn2

FOXA

_kno

wn3

FOXA

_kno

wn5

FOXA

_kno

wn6

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_disc

2

FOXA

_disc

3

FOXA

_disc

4

FOXA

_disc

5

FOXA

_disc

1

FOXA

_kno

wn1

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1

04

03

03

03

03

07

06

07

08

08

08

06

07

08

08

08

07

08

08

08

06

08

07

08

08

08

08

08

08

09

08

09

09

09

09

08

08

09

09

09

07

08

08

09

09

05

03

05

05

06

03

03

04

04

04

04

03

04

04

03

FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1

03

02

02

02

02

03

05

04

04

03

02

07

06

06

06

05

05

04

04

03

04

03

06

06

06

06

04

05

05

04

03

04

03

06

06

06

06

04

05

04

04

03

04

04

06

06

06

06

04

05

04

04

03

03

03

07

07

07

07

06

05

04

04

03

05

04

06

06

06

06

05

05

04

04

04

05

04

06

06

06

06

05

05

04

04

04

05

04

07

07

07

07

06

05

04

04

03

04

03

09

09

09

09

08

04

04

04

04

03

04

04

04

03

03

03

04

03

04

05

08

08

04

04

04

04

05

08

08

08

10

04

05

RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

04

04

05

10

08

04

04

04

05

04

02

04

04

04

04

03

02

03

04

04

04

03

03

04

04

04

04

03

01

03

03

03

03

03

02

03

02

03

02

02

02

03

03

03

02

02

02

03

03

03

02

02

01

04

04

04

04

02

02

08

08

08

08

03

02

05

05

04

04

04

02

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

(a)

(d)

(e)

(b)

(c)

Figure

1PipelineoutputforFOXA

factorgroupanexample

thathighlights

differentaspects

oftheresource(intheinterest

ofspaceonly

selected

columnsare

shownin

theenlargem

ents)(a)

Theknownanddiscovered

motifs

forthefactorgroupdrawnwithWebLogo(25)Because

theoriginalorientationis

arbitrarymotifs

are

flipped

asnecessary

toincrease

thesimilarity

ofthe

displayed

orientations

(bandc)

Sim

ilarity

betweenthemotifs

forthis

factorgroupandthemotifs(b)forthis

factorgroup(c)forallother

motifs

thatmatchatleast

onemotifforthis

factor

groupwithsimilarity

075Motifsimilarities

are

shownin

blackw

hitescale

withgraystartingat065(d)Enrichments

ofmotifs

inthedata

sets

forthis

factorgroupData

sets

are

named

indicatingthefactor

celltypestage

labcode

followed

byvalues

todifferentiate

data

sets

andmakethem

unique

Red

isusedforenrichmentblueis

usedfordepletion(w

hichis

rare

inthis

study)Enrichments

are

notshownformotifs

when

controlmotifs

could

notbegeneratedortherewasnotsufficientinform

ationcontentto

matchata48P-value

(e)Magnified

enrichment

heatm

apvalue

Thethreetrianglesrepresentthebackgroundregionsusedforenrichmentthetoptriangle

isallregions

theleft

triangle

isthose

within

2kbofanannotatedTSSandtheright

triangle

isthose

outsidethe2kbwindow

(theregionsusedin

theleftandrighttrianglespartitionthatofthetoptriangle)Thenumber

shownistheenrichmentin

theinclusivebackgroundHere

weseeanapparentcontradictionahigher

enrichmentfortheunionthanthepartsThisoccurs

because

thehigher

counts

permitasm

aller

confidence

intervalaroundtheratiosusedto

compute

theenrichmentHeatm

apsare

ordered

usinghierarchicalclusteringfollowed

byoptimalleafordering(26)TheSupplementary

Resultsincludeananalysisoftherobustnessofthemotifsimilarity

metricandtheenrichmentvalues

Nucleic Acids Research 2013 3

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

be relatively comprehensive and had difficulty findingmatches to novel motifs outside it An exception isZNF263_disc1 which does not match a motif in ourdatabase but does roughly match the specificity forZNF263 indicated in (30) despite only having weak en-richment (18-fold)Although the motifs that match each other (either

known or discovered) generally have similar enrichmentsin some cases we find substantially higher enrichment forsome motif variants over others (Figure 4 andSupplementary Table S3) For example NFE2_disc1matches the known NFE2 motif but has a 76-foldmaximal enrichment across NFE2 data sets comparedwith 56-fold enrichment for the most enriched knownNFE2 motif Different known motifs for the same factoroften show a broad range in enrichment MEF2 has sixmotifs described in Transfac with an enrichment differen-tial of as much as 4-fold consistently across data sets Thisenrichment analysis provides a systematic way to chooseamong variants of a motifWe also saw varying enrichment of the known motif

depending on the specific data set for a factor group Forexample CTCF_known2 is enriched in CTCF data sets ina range from 30- to 78-fold on identically processed dataThis may be a result of varying quality of the samplesacross data sets or may be a consequence of true biologicaldifferencesIdentifying the sequence specificity for factors that were

previously uncharacterized is of particular interest In all17 factor groups had no known motif but now have dis-covered enriched motifs (BCL BDP1 CCNT2 CHD2CTCFL HDAC2 HMGN3 RAD21 SETDB1 SIRT6SMARC SMC3 SP2 SIN3A THAP1 TRIM28 andZNF263) These discovered motifs may represent thedirect or indirect (eg through cofactors) DNA bindingspecificity

Shared motifs suggest interacting relationships

We find that most factors have motifs for other factorsenriched in their binding sites (summarized inSupplementary Table S4) This may occur due to (i) co-operative binding of the two factors to the same locations(ii) interfering binding between factors where one bindsnear the other to prevent binding (iii) some similarity inmotif specificity (iv) the two factors functioning on asimilar set of genes (eg ones specific to one tissue)without directly interacting or (v) the factors binding tosimilar genomic regions (eg near genes) Our analysisdoes not directly rule out any of these possibilitieshowever (iii) is generally verifiable using our motif simi-larity metrics and (v) can be examined by inspecting onlythe TSS-proximal enrichment

The motif most enriched in multiple data sets was theTPA DNA response element (TRE TGA[CG]TCA)which is recognized by the AP1 TF when it is formed byFOSJUN dimers (31) and other factors including MAFand NFE2 The enrichment of the TRE in a data set isoften stronger than that of even the known in vitrosequence specificity and may arise from a number of phe-nomena including (i) a cooperatively interaction withAP1 (ii) competition with AP1 for the same bindingsites leading to a potentially repressive role for the TFor (iii) reuse of binding sites due to for example accessi-bility of chromatin We find a motif matching the TREmotif for 20 factor groups (AP1_disc3 AP2_disc1BATF_disc1 BCL_disc2 CTCF_disc8-9 EP300_disc1GATA_disc2 HMGN3_disc1 IRF_disc2 MAF_disc1MEF2_disc3 MYC_disc3 NFE2_disc1 NR3C1_disc2PRDM1_disc2 RXRA_disc3 SMARC_disc1 STAT_disc2 TCF7L2_disc1 and TRIM28_disc1)

We found that the enrichment of the TRE to be par-ticularly notable for a few factors GATA and AP1 have

Figure 2 Outline of motif discovery pipeline Input regions for each data set are randomly partitioned into two groups The top 250 regions of oneof the partitions are scanned for motifs using five de novo motif discovery tools These motifs are evaluated using the peaks from the otherpartitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group

4 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

known cooperative binding (32) TFs in the SMARCfactor group are members of the SWISNF chromatinremodeling complex (33) which is necessary for properregulation by FOSJUN dimers (34) and TCF7L2_

disc1 which matches the TRE is more enriched thanthe known TCF7L2 motif (TCF7L2_disc2) in only theTCF7L2 colorectal cancer cell line HCF-116 data set con-sistent with the known interaction of JUN and TCF7L2during intestinal cancer development (35)AP1 also binds to the cAMP response element (CRE T

GACGTCA) when the dimer is formed by ATF3JUN(31) and this is the motif we find as AP1_disc1However AP1_disc3 (which matches the TRE) is themost enriched motif in FOS data sets InterestinglyATF3_disc1 is not the CRE but rather the E-box (seelater in text) We do however find a variant of theCRE (with additional specificity) as ATF3_disc2 Themost enriched discovered motif for E2F E2F_disc1 alsomatches the CRE and is highly enriched in all data setsMYC is a critical regulator which recognizes the E-box

sequence To aid in comparisons we include MAX whichforms complexes with MYC and USF12 which also rec-ognizes the E-box sequence in the MYC factor group

(a)

(b)

Figure 3 (a) Summary of input data used The outside ring indicatesthe experimental data sets (one tick for each of 427) which areseparated into 123 transcription factors (second ring) The TFs arefurther grouped into 84 factor groups (third ring) We are able tofind a matching discovered motif for 41 of the 56 factor groups witha known motif 29 of these 41 factor groups have additional discoveredmotifs that may be associated with cofactors For all but 1 of the 15factor groups where the known motif is not recovered we still findenriched discovered motifs We also discovered enriched motifs for 17of the 28 factor groups without a known motif (b) Recovery of knownmotifs by each of the discovery tools Performance of discovery interms of number of factor groups for which the known motif was re-covered A motif is considered a match if it matches any of the knownmotifs for a factor group (see Supplementary Methods for details onhow matches are computed) The number of additional factors thathave a match is shown with each additional motif (only three motifsare taken from each individual method whereas we have up to 10 forthe pipeline) The number of factor groups with no motif match isshown in parenthesis When multiple data sets exist for a factorgroup the fraction that matches is used in computing its contributionfor computing the performance of the individual tools

Figure 4 Comparison of known versus discovered motifs (selectedwhere discovered better enriched than known all factor groups witha discovered motif matching a known motif in Supplementary TableS3) Displayed is the known and discovered motif with the maximumenrichment across all data sets for a factor group Only the discoveredmotifs that match a known motif for a factor group are consideredThe maximum enrichment is indicated for each factor and in paren-thesis the lsquorawrsquo enrichment for the same data set without the use of theshuffle motifs for correction

Nucleic Acids Research 2013 5

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 3: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

56

All Dist

alN

ear

TSS

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1FOX_1FOXD3_1FOXD3_2FOXC2_2FOXC1_6FOXL1_5FOXC1_3FOXL1_1FOXF1_1FOXQ1_1FOXQ1_2SRY_3FOXJ2_5FOXK1_1FOXL1_3FOXJ3_1FOXF2_2FOXF2_1FOXD1_1FOXC2_3FOXJ3_2FOXO3_2FOXJ1_2FOXO3_1FOXO4_2FOXO1_2FOXO3_3FOXG1_5FOXO1_3FOXO4_4FOXO6_2FOXI1_3FOXD1_2FOXO3_5FOXB1_4FOXD3_4TCF12_disc2FOXD2_2FOXC1_7FOXL1_4FOXK1_4FOXP3_2FOXJ2_4FOXJ3_7HDAC2_disc2EP300_disc3FOXO1_1FOXO4_1FOXC1_1FOXC1_5FOXB1_3FOXB1_2FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1HNF4_known9HNF4_known6HNF4_known13HNF4_known2 RXRA_disc1NR4A_known4RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

03

03

03

03

03

03

03

04

04

03

03

04

04

04

04

03

04

04

04

04

05

05

04

04

04

04

04

04

04

04

05

05

05

06

06

05

05

05

05

05

04

05

05

05

05

05

05

05

04

04

05

05

03

04

04

03

03

02

02

02

02

03

05

04

04

03

02

03

03

03

03

04

04

04

04

04

05

10

08

07

06

07

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

09

08

09

09

09

09

09

08

08

09

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

08

08

09

09

07

07

06

06

06

05

05

04

04

03

04

03

04

04

04

04

04

04

04

04

04

05

04

02

08

06

07

08

08

08

07

07

07

07

07

07

07

08

07

07

07

07

09

09

09

09

09

09

09

08

08

08

09

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

08

10

10

08

08

08

09

09

07

06

06

06

06

04

05

05

04

03

04

03

04

03

03

03

04

04

04

04

04

04

03

02

08

07

08

08

08

08

07

07

07

07

07

07

07

08

08

07

07

07

08

08

08

08

09

09

09

08

08

08

08

08

08

09

09

09

09

09

09

09

09

09

09

09

09

08

09

09

09

09

09

10

10

08

08

07

09

08

06

06

06

06

06

04

05

04

04

03

04

04

04

03

03

03

03

03

03

04

04

04

03

03

06

08

07

08

08

08

07

07

08

08

08

08

08

08

08

08

08

08

10

10

10

09

09

09

10

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

09

08

09

08

09

09

09

09

09

09

09

08

08

09

09

09

07

06

06

06

06

04

05

04

04

03

03

03

03

03

03

03

04

04

04

04

04

04

03

01

08

08

08

08

09

09

08

08

08

08

08

08

07

08

08

07

07

07

08

08

08

08

08

08

09

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

08

09

09

08

08

08

09

08

07

07

07

07

07

06

05

04

04

03

05

04

03

03

03

03

03

03

03

03

03

03

03

02

08

09

09

09

09

09

09

09

09

09

09

08

08

09

08

08

06

06

07

08

08

08

08

08

09

08

08

08

07

07

07

07

08

08

08

08

08

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

07

06

06

06

06

06

05

05

04

04

04

05

04

04

03

03

03

03

03

03

02

03

02

02

02

08

08

09

09

09

09

09

08

09

08

08

08

08

08

08

07

06

06

07

08

08

07

07

08

09

08

07

08

07

07

07

07

08

08

08

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

09

08

08

07

08

08

06

06

06

06

06

05

05

04

04

04

05

04

04

04

03

03

03

03

03

03

03

02

02

02

07

08

08

09

09

09

09

09

08

08

08

08

08

08

08

08

06

06

07

07

07

07

07

08

09

07

07

07

07

06

07

07

08

07

07

07

07

07

07

07

08

08

08

08

08

08

08

08

08

09

08

07

07

07

09

08

07

07

07

07

07

06

05

04

04

03

04

03

03

03

03

03

03

03

03

03

03

02

02

01

05

03

05

05

06

05

06

06

05

05

05

05

05

05

05

05

04

06

06

06

06

06

06

06

07

06

05

05

05

05

05

06

07

06

07

06

07

07

06

07

07

07

07

06

07

07

06

06

06

07

06

05

05

06

08

08

08

09

09

09

09

08

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

02

02

03

03

04

04

04

04

04

04

04

04

03

04

04

03

04

04

04

05

05

04

04

05

05

04

05

04

04

04

04

04

04

04

04

04

04

04

04

04

04

03

04

03

04

04

04

04

04

04

04

04

04

04

03

03

05

04

04

04

04

03

03

03

04

03

04

05

08

08

08

08

08

08

08

08

08

08

08

08

03

02

04

03

04

04

03

03

04

04

03

03

03

03

04

03

03

04

04

03

03

02

03

03

02

02

03

04

04

02

03

03

03

03

03

03

03

03

03

02

02

03

04

02

04

03

03

03

03

03

03

03

03

03

03

03

03

03

03

04

04

04

04

05

08

08

08

10

04

05

05

05

06

06

06

05

05

05

04

04

04

02

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

37

18

20

37

36

34

29

17

29

28

19

11

18

55

48

49

60

70

106

58

72

34

74

43

63

58

87

44

77

17

511

78

84

68

37

23

28

81

74

85

59

44

33

55

52

15

13

11

26

46

57

36

44

44

36

19

27

28

22

29

10

13

25

33

22

41

32

39

18

21

38

37

36

31

17

29

29

19

12

18

58

50

52

60

73

106

88

72

44

84

63

73

79

47

74

67

47

911

79

88

70

37

23

29

83

76

88

61

45

33

57

52

15

13

11

29

49

57

37

46

45

37

19

26

28

25

28

10

12

29

36

23

45

35

42

19

21

41

39

37

33

16

28

30

20

12

19

63

53

55

66

76

117

59

32

55

34

83

83

69

77

85

07

58

413

84

91

79

39

23

30

90

83

98

71

54

40

69

62

17

11

11

29

48

58

38

47

47

38

19

29

30

24

30

10

13

21

34

22

46

34

35

19

22

36

39

37

32

18

32

27

18

11

20

60

51

53

60

72

95

67

93

27

51

52

40

42

108

94

58

68

211

96

106

93

82

53

18

37

28

75

84

23

35

35

01

51

21

51

01

01

01

01

01

01

01

01

01

01

01

01

01

2

19

28

21

48

30

45

14

17

39

39

35

31

13

26

28

15

12

16

76

60

70

70

96

158

412

29

60

68

51

43

1113

33

139

115

1614

82

40

28

36

129

19

93

11

81

32

12

01

01

11

91

01

01

01

01

01

01

01

01

01

01

01

01

51

4

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-C

-20

FOXA

1_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-10

1058

FOXA

2_He

pG2_

enco

de-M

yers

_seq

_hsa

_v04

1610

1-S

C-65

54FO

XA1_

T-47

D_en

code

-Mye

rs_s

eq_h

sa_D

MSO

-00

2pct

-v04

1610

2-C

-20

FOXA

1_EC

C-1_

enco

de-M

yers

_seq

_hsa

_DM

SO-0

02p

ct-v

0416

102

-C-2

0

10

1412

90

64

48

43

46

40

57

16

11

11

1412

96

66

49

45

47

42

56

17

11

11

1613

117

25

65

15

64

76

61

71

1

10

1412

107

05

24

34

44

05

50

91

4

18

2020

139

36

64

65

14

62

51

01

5

24

33

22

39

32

25

33

22

41

32

29

36

23

45

35

21

34

22

46

34

19

28

21

48

30

44

33

55

52

15

13

11

26

46

45

33

57

52

15

13

11

29

49

54

40

69

62

17

11

11

29

48

42

33

53

50

15

12

15

10

10

18

13

21

20

10

11

19

10

10

27

28

22

29

10

13

26

28

25

28

10

12

29

30

24

30

10

13

10

10

10

10

10

12

10

10

10

10

15

14

FOXA

_kno

wn2

FOXA

_kno

wn3

FOXA

_kno

wn5

FOXA

_kno

wn6

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_disc

2

FOXA

_disc

3

FOXA

_disc

4

FOXA

_disc

5

FOXA

_disc

1

FOXA

_kno

wn1

HNF4_disc4FOXJ1_1FOXI1_2FOXI1_1FOXJ2_1

04

03

03

03

03

07

06

07

08

08

08

06

07

08

08

08

07

08

08

08

06

08

07

08

08

08

08

08

08

09

08

09

09

09

09

08

08

09

09

09

07

08

08

09

09

05

03

05

05

06

03

03

04

04

04

04

03

04

04

03

FOXD3_3FOXD2_1FOXC2_1FOXC1_4ARID5A_1STAT_known8FEV_1SPI1_known2STAT_disc5TCF7L1_2HNF4_known1

03

02

02

02

02

03

05

04

04

03

02

07

06

06

06

05

05

04

04

03

04

03

06

06

06

06

04

05

05

04

03

04

03

06

06

06

06

04

05

04

04

03

04

04

06

06

06

06

04

05

04

04

03

03

03

07

07

07

07

06

05

04

04

03

05

04

06

06

06

06

05

05

04

04

04

05

04

06

06

06

06

05

05

04

04

04

05

04

07

07

07

07

06

05

04

04

03

04

03

09

09

09

09

08

04

04

04

04

03

04

04

04

03

03

03

04

03

04

05

08

08

04

04

04

04

05

08

08

08

10

04

05

RARA_1NR2F2_2NR4A_known1HNF4_known26NR3C1_disc5TCF12_known1

04

04

04

05

10

08

04

04

04

05

04

02

04

04

04

04

03

02

03

04

04

04

03

03

04

04

04

04

03

01

03

03

03

03

03

02

03

02

03

02

02

02

03

03

03

02

02

02

03

03

03

02

02

01

04

04

04

04

02

02

08

08

08

08

03

02

05

05

04

04

04

02

FOXA

_disc

3FO

XA_k

now

n6FO

XA_d

isc1

FOXA

_kno

wn5

FOXA

_kno

wn7

FOXA

_kno

wn4

FOXA

_kno

wn3

FOXA

_kno

wn2

FOXA

_kno

wn1

FOXA

_disc

2FO

XA_d

isc5

FOXA

_disc

4

FOXA_disc3FOXA_known6FOXA_disc1FOXA_known5FOXA_known7FOXA_known4FOXA_known3FOXA_known2FOXA_known1FOXA_disc2FOXA_disc5FOXA_disc4

10

05

04

04

04

04

03

03

03

03

03

04

05

10

10

10

09

09

08

08

08

07

04

03

04

10

10

10

09

09

08

09

08

07

04

03

04

10

10

10

09

09

09

09

08

06

04

03

04

09

09

09

10

08

08

08

08

07

04

03

04

09

09

09

08

10

09

09

09

06

04

03

03

08

08

09

08

09

10

10

09

05

04

03

03

08

09

09

08

09

10

10

09

06

04

04

03

08

08

08

08

09

09

09

10

07

04

03

03

07

07

06

07

06

05

06

07

10

04

04

03

04

04

04

04

04

04

04

04

04

10

05

04

03

03

03

03

03

03

04

03

04

05

10

(a)

(d)

(e)

(b)

(c)

Figure

1PipelineoutputforFOXA

factorgroupanexample

thathighlights

differentaspects

oftheresource(intheinterest

ofspaceonly

selected

columnsare

shownin

theenlargem

ents)(a)

Theknownanddiscovered

motifs

forthefactorgroupdrawnwithWebLogo(25)Because

theoriginalorientationis

arbitrarymotifs

are

flipped

asnecessary

toincrease

thesimilarity

ofthe

displayed

orientations

(bandc)

Sim

ilarity

betweenthemotifs

forthis

factorgroupandthemotifs(b)forthis

factorgroup(c)forallother

motifs

thatmatchatleast

onemotifforthis

factor

groupwithsimilarity

075Motifsimilarities

are

shownin

blackw

hitescale

withgraystartingat065(d)Enrichments

ofmotifs

inthedata

sets

forthis

factorgroupData

sets

are

named

indicatingthefactor

celltypestage

labcode

followed

byvalues

todifferentiate

data

sets

andmakethem

unique

Red

isusedforenrichmentblueis

usedfordepletion(w

hichis

rare

inthis

study)Enrichments

are

notshownformotifs

when

controlmotifs

could

notbegeneratedortherewasnotsufficientinform

ationcontentto

matchata48P-value

(e)Magnified

enrichment

heatm

apvalue

Thethreetrianglesrepresentthebackgroundregionsusedforenrichmentthetoptriangle

isallregions

theleft

triangle

isthose

within

2kbofanannotatedTSSandtheright

triangle

isthose

outsidethe2kbwindow

(theregionsusedin

theleftandrighttrianglespartitionthatofthetoptriangle)Thenumber

shownistheenrichmentin

theinclusivebackgroundHere

weseeanapparentcontradictionahigher

enrichmentfortheunionthanthepartsThisoccurs

because

thehigher

counts

permitasm

aller

confidence

intervalaroundtheratiosusedto

compute

theenrichmentHeatm

apsare

ordered

usinghierarchicalclusteringfollowed

byoptimalleafordering(26)TheSupplementary

Resultsincludeananalysisoftherobustnessofthemotifsimilarity

metricandtheenrichmentvalues

Nucleic Acids Research 2013 3

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

be relatively comprehensive and had difficulty findingmatches to novel motifs outside it An exception isZNF263_disc1 which does not match a motif in ourdatabase but does roughly match the specificity forZNF263 indicated in (30) despite only having weak en-richment (18-fold)Although the motifs that match each other (either

known or discovered) generally have similar enrichmentsin some cases we find substantially higher enrichment forsome motif variants over others (Figure 4 andSupplementary Table S3) For example NFE2_disc1matches the known NFE2 motif but has a 76-foldmaximal enrichment across NFE2 data sets comparedwith 56-fold enrichment for the most enriched knownNFE2 motif Different known motifs for the same factoroften show a broad range in enrichment MEF2 has sixmotifs described in Transfac with an enrichment differen-tial of as much as 4-fold consistently across data sets Thisenrichment analysis provides a systematic way to chooseamong variants of a motifWe also saw varying enrichment of the known motif

depending on the specific data set for a factor group Forexample CTCF_known2 is enriched in CTCF data sets ina range from 30- to 78-fold on identically processed dataThis may be a result of varying quality of the samplesacross data sets or may be a consequence of true biologicaldifferencesIdentifying the sequence specificity for factors that were

previously uncharacterized is of particular interest In all17 factor groups had no known motif but now have dis-covered enriched motifs (BCL BDP1 CCNT2 CHD2CTCFL HDAC2 HMGN3 RAD21 SETDB1 SIRT6SMARC SMC3 SP2 SIN3A THAP1 TRIM28 andZNF263) These discovered motifs may represent thedirect or indirect (eg through cofactors) DNA bindingspecificity

Shared motifs suggest interacting relationships

We find that most factors have motifs for other factorsenriched in their binding sites (summarized inSupplementary Table S4) This may occur due to (i) co-operative binding of the two factors to the same locations(ii) interfering binding between factors where one bindsnear the other to prevent binding (iii) some similarity inmotif specificity (iv) the two factors functioning on asimilar set of genes (eg ones specific to one tissue)without directly interacting or (v) the factors binding tosimilar genomic regions (eg near genes) Our analysisdoes not directly rule out any of these possibilitieshowever (iii) is generally verifiable using our motif simi-larity metrics and (v) can be examined by inspecting onlythe TSS-proximal enrichment

The motif most enriched in multiple data sets was theTPA DNA response element (TRE TGA[CG]TCA)which is recognized by the AP1 TF when it is formed byFOSJUN dimers (31) and other factors including MAFand NFE2 The enrichment of the TRE in a data set isoften stronger than that of even the known in vitrosequence specificity and may arise from a number of phe-nomena including (i) a cooperatively interaction withAP1 (ii) competition with AP1 for the same bindingsites leading to a potentially repressive role for the TFor (iii) reuse of binding sites due to for example accessi-bility of chromatin We find a motif matching the TREmotif for 20 factor groups (AP1_disc3 AP2_disc1BATF_disc1 BCL_disc2 CTCF_disc8-9 EP300_disc1GATA_disc2 HMGN3_disc1 IRF_disc2 MAF_disc1MEF2_disc3 MYC_disc3 NFE2_disc1 NR3C1_disc2PRDM1_disc2 RXRA_disc3 SMARC_disc1 STAT_disc2 TCF7L2_disc1 and TRIM28_disc1)

We found that the enrichment of the TRE to be par-ticularly notable for a few factors GATA and AP1 have

Figure 2 Outline of motif discovery pipeline Input regions for each data set are randomly partitioned into two groups The top 250 regions of oneof the partitions are scanned for motifs using five de novo motif discovery tools These motifs are evaluated using the peaks from the otherpartitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group

4 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

known cooperative binding (32) TFs in the SMARCfactor group are members of the SWISNF chromatinremodeling complex (33) which is necessary for properregulation by FOSJUN dimers (34) and TCF7L2_

disc1 which matches the TRE is more enriched thanthe known TCF7L2 motif (TCF7L2_disc2) in only theTCF7L2 colorectal cancer cell line HCF-116 data set con-sistent with the known interaction of JUN and TCF7L2during intestinal cancer development (35)AP1 also binds to the cAMP response element (CRE T

GACGTCA) when the dimer is formed by ATF3JUN(31) and this is the motif we find as AP1_disc1However AP1_disc3 (which matches the TRE) is themost enriched motif in FOS data sets InterestinglyATF3_disc1 is not the CRE but rather the E-box (seelater in text) We do however find a variant of theCRE (with additional specificity) as ATF3_disc2 Themost enriched discovered motif for E2F E2F_disc1 alsomatches the CRE and is highly enriched in all data setsMYC is a critical regulator which recognizes the E-box

sequence To aid in comparisons we include MAX whichforms complexes with MYC and USF12 which also rec-ognizes the E-box sequence in the MYC factor group

(a)

(b)

Figure 3 (a) Summary of input data used The outside ring indicatesthe experimental data sets (one tick for each of 427) which areseparated into 123 transcription factors (second ring) The TFs arefurther grouped into 84 factor groups (third ring) We are able tofind a matching discovered motif for 41 of the 56 factor groups witha known motif 29 of these 41 factor groups have additional discoveredmotifs that may be associated with cofactors For all but 1 of the 15factor groups where the known motif is not recovered we still findenriched discovered motifs We also discovered enriched motifs for 17of the 28 factor groups without a known motif (b) Recovery of knownmotifs by each of the discovery tools Performance of discovery interms of number of factor groups for which the known motif was re-covered A motif is considered a match if it matches any of the knownmotifs for a factor group (see Supplementary Methods for details onhow matches are computed) The number of additional factors thathave a match is shown with each additional motif (only three motifsare taken from each individual method whereas we have up to 10 forthe pipeline) The number of factor groups with no motif match isshown in parenthesis When multiple data sets exist for a factorgroup the fraction that matches is used in computing its contributionfor computing the performance of the individual tools

Figure 4 Comparison of known versus discovered motifs (selectedwhere discovered better enriched than known all factor groups witha discovered motif matching a known motif in Supplementary TableS3) Displayed is the known and discovered motif with the maximumenrichment across all data sets for a factor group Only the discoveredmotifs that match a known motif for a factor group are consideredThe maximum enrichment is indicated for each factor and in paren-thesis the lsquorawrsquo enrichment for the same data set without the use of theshuffle motifs for correction

Nucleic Acids Research 2013 5

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 4: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

be relatively comprehensive and had difficulty findingmatches to novel motifs outside it An exception isZNF263_disc1 which does not match a motif in ourdatabase but does roughly match the specificity forZNF263 indicated in (30) despite only having weak en-richment (18-fold)Although the motifs that match each other (either

known or discovered) generally have similar enrichmentsin some cases we find substantially higher enrichment forsome motif variants over others (Figure 4 andSupplementary Table S3) For example NFE2_disc1matches the known NFE2 motif but has a 76-foldmaximal enrichment across NFE2 data sets comparedwith 56-fold enrichment for the most enriched knownNFE2 motif Different known motifs for the same factoroften show a broad range in enrichment MEF2 has sixmotifs described in Transfac with an enrichment differen-tial of as much as 4-fold consistently across data sets Thisenrichment analysis provides a systematic way to chooseamong variants of a motifWe also saw varying enrichment of the known motif

depending on the specific data set for a factor group Forexample CTCF_known2 is enriched in CTCF data sets ina range from 30- to 78-fold on identically processed dataThis may be a result of varying quality of the samplesacross data sets or may be a consequence of true biologicaldifferencesIdentifying the sequence specificity for factors that were

previously uncharacterized is of particular interest In all17 factor groups had no known motif but now have dis-covered enriched motifs (BCL BDP1 CCNT2 CHD2CTCFL HDAC2 HMGN3 RAD21 SETDB1 SIRT6SMARC SMC3 SP2 SIN3A THAP1 TRIM28 andZNF263) These discovered motifs may represent thedirect or indirect (eg through cofactors) DNA bindingspecificity

Shared motifs suggest interacting relationships

We find that most factors have motifs for other factorsenriched in their binding sites (summarized inSupplementary Table S4) This may occur due to (i) co-operative binding of the two factors to the same locations(ii) interfering binding between factors where one bindsnear the other to prevent binding (iii) some similarity inmotif specificity (iv) the two factors functioning on asimilar set of genes (eg ones specific to one tissue)without directly interacting or (v) the factors binding tosimilar genomic regions (eg near genes) Our analysisdoes not directly rule out any of these possibilitieshowever (iii) is generally verifiable using our motif simi-larity metrics and (v) can be examined by inspecting onlythe TSS-proximal enrichment

The motif most enriched in multiple data sets was theTPA DNA response element (TRE TGA[CG]TCA)which is recognized by the AP1 TF when it is formed byFOSJUN dimers (31) and other factors including MAFand NFE2 The enrichment of the TRE in a data set isoften stronger than that of even the known in vitrosequence specificity and may arise from a number of phe-nomena including (i) a cooperatively interaction withAP1 (ii) competition with AP1 for the same bindingsites leading to a potentially repressive role for the TFor (iii) reuse of binding sites due to for example accessi-bility of chromatin We find a motif matching the TREmotif for 20 factor groups (AP1_disc3 AP2_disc1BATF_disc1 BCL_disc2 CTCF_disc8-9 EP300_disc1GATA_disc2 HMGN3_disc1 IRF_disc2 MAF_disc1MEF2_disc3 MYC_disc3 NFE2_disc1 NR3C1_disc2PRDM1_disc2 RXRA_disc3 SMARC_disc1 STAT_disc2 TCF7L2_disc1 and TRIM28_disc1)

We found that the enrichment of the TRE to be par-ticularly notable for a few factors GATA and AP1 have

Figure 2 Outline of motif discovery pipeline Input regions for each data set are randomly partitioned into two groups The top 250 regions of oneof the partitions are scanned for motifs using five de novo motif discovery tools These motifs are evaluated using the peaks from the otherpartitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group

4 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

known cooperative binding (32) TFs in the SMARCfactor group are members of the SWISNF chromatinremodeling complex (33) which is necessary for properregulation by FOSJUN dimers (34) and TCF7L2_

disc1 which matches the TRE is more enriched thanthe known TCF7L2 motif (TCF7L2_disc2) in only theTCF7L2 colorectal cancer cell line HCF-116 data set con-sistent with the known interaction of JUN and TCF7L2during intestinal cancer development (35)AP1 also binds to the cAMP response element (CRE T

GACGTCA) when the dimer is formed by ATF3JUN(31) and this is the motif we find as AP1_disc1However AP1_disc3 (which matches the TRE) is themost enriched motif in FOS data sets InterestinglyATF3_disc1 is not the CRE but rather the E-box (seelater in text) We do however find a variant of theCRE (with additional specificity) as ATF3_disc2 Themost enriched discovered motif for E2F E2F_disc1 alsomatches the CRE and is highly enriched in all data setsMYC is a critical regulator which recognizes the E-box

sequence To aid in comparisons we include MAX whichforms complexes with MYC and USF12 which also rec-ognizes the E-box sequence in the MYC factor group

(a)

(b)

Figure 3 (a) Summary of input data used The outside ring indicatesthe experimental data sets (one tick for each of 427) which areseparated into 123 transcription factors (second ring) The TFs arefurther grouped into 84 factor groups (third ring) We are able tofind a matching discovered motif for 41 of the 56 factor groups witha known motif 29 of these 41 factor groups have additional discoveredmotifs that may be associated with cofactors For all but 1 of the 15factor groups where the known motif is not recovered we still findenriched discovered motifs We also discovered enriched motifs for 17of the 28 factor groups without a known motif (b) Recovery of knownmotifs by each of the discovery tools Performance of discovery interms of number of factor groups for which the known motif was re-covered A motif is considered a match if it matches any of the knownmotifs for a factor group (see Supplementary Methods for details onhow matches are computed) The number of additional factors thathave a match is shown with each additional motif (only three motifsare taken from each individual method whereas we have up to 10 forthe pipeline) The number of factor groups with no motif match isshown in parenthesis When multiple data sets exist for a factorgroup the fraction that matches is used in computing its contributionfor computing the performance of the individual tools

Figure 4 Comparison of known versus discovered motifs (selectedwhere discovered better enriched than known all factor groups witha discovered motif matching a known motif in Supplementary TableS3) Displayed is the known and discovered motif with the maximumenrichment across all data sets for a factor group Only the discoveredmotifs that match a known motif for a factor group are consideredThe maximum enrichment is indicated for each factor and in paren-thesis the lsquorawrsquo enrichment for the same data set without the use of theshuffle motifs for correction

Nucleic Acids Research 2013 5

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 5: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

known cooperative binding (32) TFs in the SMARCfactor group are members of the SWISNF chromatinremodeling complex (33) which is necessary for properregulation by FOSJUN dimers (34) and TCF7L2_

disc1 which matches the TRE is more enriched thanthe known TCF7L2 motif (TCF7L2_disc2) in only theTCF7L2 colorectal cancer cell line HCF-116 data set con-sistent with the known interaction of JUN and TCF7L2during intestinal cancer development (35)AP1 also binds to the cAMP response element (CRE T

GACGTCA) when the dimer is formed by ATF3JUN(31) and this is the motif we find as AP1_disc1However AP1_disc3 (which matches the TRE) is themost enriched motif in FOS data sets InterestinglyATF3_disc1 is not the CRE but rather the E-box (seelater in text) We do however find a variant of theCRE (with additional specificity) as ATF3_disc2 Themost enriched discovered motif for E2F E2F_disc1 alsomatches the CRE and is highly enriched in all data setsMYC is a critical regulator which recognizes the E-box

sequence To aid in comparisons we include MAX whichforms complexes with MYC and USF12 which also rec-ognizes the E-box sequence in the MYC factor group

(a)

(b)

Figure 3 (a) Summary of input data used The outside ring indicatesthe experimental data sets (one tick for each of 427) which areseparated into 123 transcription factors (second ring) The TFs arefurther grouped into 84 factor groups (third ring) We are able tofind a matching discovered motif for 41 of the 56 factor groups witha known motif 29 of these 41 factor groups have additional discoveredmotifs that may be associated with cofactors For all but 1 of the 15factor groups where the known motif is not recovered we still findenriched discovered motifs We also discovered enriched motifs for 17of the 28 factor groups without a known motif (b) Recovery of knownmotifs by each of the discovery tools Performance of discovery interms of number of factor groups for which the known motif was re-covered A motif is considered a match if it matches any of the knownmotifs for a factor group (see Supplementary Methods for details onhow matches are computed) The number of additional factors thathave a match is shown with each additional motif (only three motifsare taken from each individual method whereas we have up to 10 forthe pipeline) The number of factor groups with no motif match isshown in parenthesis When multiple data sets exist for a factorgroup the fraction that matches is used in computing its contributionfor computing the performance of the individual tools

Figure 4 Comparison of known versus discovered motifs (selectedwhere discovered better enriched than known all factor groups witha discovered motif matching a known motif in Supplementary TableS3) Displayed is the known and discovered motif with the maximumenrichment across all data sets for a factor group Only the discoveredmotifs that match a known motif for a factor group are consideredThe maximum enrichment is indicated for each factor and in paren-thesis the lsquorawrsquo enrichment for the same data set without the use of theshuffle motifs for correction

Nucleic Acids Research 2013 5

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 6: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

We find multiple motifs enriched in MYC binding siteshighlighting the multifunctional role MYC and the otherE-box recognizing proteins play We found a version ofthe E-box with additional specificity (MYC_disc1) thatwas highly enriched in USF12 bound regions (max 98-fold for USF2 versus lt9-fold enrichment for MYCMAX) This motif was more enriched than the knownE-box motifs including known USF motifs in manyUSF data sets We find a second less specific E-boxmotif (MYC_disc2) which shows more even enrichmentacross factors We also find discovered motifs of otherfactors matching the E-box including SIN3A_disc2 (dis-cussed later in text) NFE2_disc2-3 and SIRT6_disc1 It isnotable that although SIRT6 is a chromatin-associatedprotein without a known DNA binding domain (36) theonly discovered motif matches the E-box (with 16-foldenrichment in SIRT6 bound regions) suggesting thatMYC or another E-box recognizing factor may play animportant but indirect chromatin-related roleMotif enrichment is able to identify both positive and

negative interactions for the same factor For exampleSIN3A a corepressor known to interact with a numberof proteins has discovered motifs matching REST(SIN3A_disc1 and more weakly disc3ndash4) and MYC(SIN3A_disc2) These are consistent with SIN3Arsquosknown involvement in repression by REST (37) andSIN3A being a known antagonist for MYC (38)Morever MYC_disc4 matches RFX5 and is enriched

particularly for MAX-bound regions in H1-hESC andGM12878 and MYC_disc5 matches the CEBPB knownmotif and is enriched in MYC regions bound in unstimu-lated K562 cells MXI1 which was not included in theMYC factor group although it does interact with MAXto bind to MYC-MAX sites (39) has MXI1_disc1 thatmatches RFX5 in both the K562 and HeLa-S3 cell linesWe analyzed six IRF family data sets IRF1 binding in

K562 cells stimulated by IFNa (viral innate response) orIFNg (viral bacterial and tumor control) IRF3 inHepG2 GM12878 and HeLa-S3 and IRF4 inGM12878 The most strongly enriched motif(IRF_disc1 matching NFY) is highly enriched (gt20-fold) for all three IRF3 data sets and IRF1 in K562under IFNg stimulation This suggests that binding ofIRF to NFY sites occurs only under specific conditionsand by only some IRF members and potentially expandson the previously documented interaction of NFY andIRF2 at a single promoter (40) IRF_disc4 whichmatches SP1 is enriched in the same cell types albeit atmuch lower levels IRF_disc3 which matches the knownIRF consensus shows weak-to-no enrichment in thesedata sets but shows an enrichment of 88-fold for IRF1bound regions in K562 cells under IFNa stimulation and31-fold enrichment for IRF4 bound regions in GM12878IRF_disc2 which matches the TRE is enriched primarilyin GM12878 regions bound by IRF4 The known SPI1motif matches IRF_disc5 and reciprocally SPI1_disc2matches the IRF motif consistent with the importanceof SPI1 in hematopoietic development (41)Beyond the discovered motif for IRF several other dis-

covered motifs (AP1_disc2 CEBP_disc2 E2F_disc4PBX3_disc1 RFX5_disc2 and SP1_disc1-2) match the

known NFY specificity (CCAAT) These discoveredmotifs are consistent with several known interactions ofNFY RFX5 promotes the cooperative binding betweenRFX and NFY (42) CEBPB and NFY interact in at leastone promoter (43) and SP1 and NFY are known tointeract (44) E2F_disc4 has particularly high enrichmentin E2F4 data sets consistent with the cooperative roleE2F4 and NFY play in cell cycle regulation (45)

STAT factors are involved in regulating number ofgrowth-related functions We analyze STAT1 STAT2and STAT3 here in the context of GM12878 HeLa-S3MCF10A-Er-Src and K562 cells We find relatively con-sistent enrichment of the STAT full site (TTCCNGGAA)which STAT_disc1 matches while finding weak enrich-ment for just the half-site (TTCC) We also find motifsinvolved in other proliferative functions includingSTAT_disc2 which is particularly enriched in STAT3data sets and matches the TRE consistent with STAT3being one of the many interaction partners for AP1 (46)STAT_disc3 matches the IRF consensus and has enrich-ment that is particularly high in STAT1 and STAT2 datasets stimulated by IFNa highlighting the cooperativity ofSTAT factors and IRF in immune functions STAT_disc4is a match to the CEBPB motif and is found enriched inSTAT3 data sets consistent with the known cooperativerole for these two factors (47)

TFs with ETS domains are highly conserved andinvolved in several cellular processes [reviewed in (48)]A number of TFs have discovered motifs that match theETS consensus including EGR1_disc2 GATA_disc3MEF2_disc2 NRF1_disc2 NR2C2_disc1 andPAX5_disc4 These discovered motifs are supported byknown interactions between GATA and ETS in seasquirts (49) MEF2 and the ETS factor PEA3 (50) andNR2C2 with the ETS factor ELK4 (51) MoreoverPAX5 and ETS factors have shared roles in the develop-ment of B-cells (5253) Looking at the discovered ETSmotifs we find that ETS_disc8 matches the known motiffor MYB and the two have been known to cooperate arelationship that is important in the context of certaincancers (54)

THAP1 has two discovered motifs both of whichmatch the known YY1 motif (the first with additionalspecificity added by an apparent HNF4 motif) To ourknowledge the relationship between THAP1 and YY1has not been directly observed however THAP1 hasbeen known to associate with the coactivator HCF-1(55) and YY1 and HCF-1 are known to interact (56)Our result suggests that THAP1 and YY1 possibly withthe addition of HNF4 may interact at least in the K562cell line for which we have THAP1 binding dataRAD21_disc3 also matches YY1 suggesting an additionalinteraction

NANOG an important pluripotency TF has a knownmotif that is only weakly enriched (13-fold) in the boundregions and not discovered by our pipeline We see muchstronger enrichment for the known POU5F1 andPOU2F2 motifs for which we also find similar motifs(NANOG_disc2 and NANOG_disc4 respectively) con-sistent with their shared roles in pluripotency (5758)The interaction of these factors is further supported by

6 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 7: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

POU5F1_disc2 matching the known POU2F2 motifAdditionally NANOG_disc2 and disc3 match theknown motifs for TCF7L2 and TCF12 respectivelyagain consistent with the important role TCF proteinsplay in stem cells (59)

CTCF plays a variety of vital roles in the organizationof chromatin architecture (60) and the motifs we discovermatching the known CTCF specificity (RAD21_disc1SMC3_disc12-4 CTCFL_disc110 ZBTB7A_disc12SP2_disc3 and RXRA_disc25 some weakly) are largelycompatible with this role RAD21 is a highly conservedprotein involved in DNA double-strand repair (61) knownto co-localize with CTCF (62) Cohesin of which SMC3 isa subunit is brought to the chromatin by CTCF (63)Further although the function of the CTCF paralogCTCFL is not completely known it does appear to beinvolved in imprinting through interaction with ahistone methyltransferase (64)

Combinations of motifs

A few of the discovered motifs contain additional specifi-city or have distinct segments matching multiple motifsFor example EGR1_disc4 appears to be a combination ofmultiple motifs (EGR1 IKZF1 and a homeobox motif)and SETDB1_disc1 contains the ZNF143 core sequencewith significant additional specificity The appearance ofthese motifs suggests highly specific lsquogrammarsrsquo for thesemotifs that may require specific spacing and orientation ofbinding sites for functionality

We find several additional enrichments of potentialinterest PBX3_disc2 matches the known MEIS1 motifconsistent with the known cooperative binding ofMEIS1 and PBX (65) TAL1_disc1 matches GATAwith the potential connection that GATA and TAL1 areknown to be important in hematopoesis and vascular de-velopment (6667) HSF_disc1 matches the known CEBPmotif and has much higher enrichment in HSF data sets(31-fold) compared with the known motifs for HSF (lt9-fold) Additionally EGR1_disc5 HNF4_disc5NRF1_disc3 PAX5_disc2 RXRA_disc4PAX5_disc3and SREBP_disc1 match the known motifs for ZICSOX SP1 PAX2PAX3 IRF and RFX5 respectivelysuggesting additional previously uncharacterized inter-actions Lastly we find some motifs that show more am-biguous matches SMARC_disc2 shows weak similarity tohomeobox TGTAGT motif NR2C2_disc2ndash3 weaklymatches the known HNF4 motif and EGR1_disc3SETDB1_disc2 matches the repetitive NRF1 motif

General factors enriched in cell line-specific key regulators

Factors directly responsible for the establishment of en-hancers chromatin restructuring or polymerase recruit-ment frequently exhibit binding that is highly cell typespecific Because most of these factors do not have theirown sequence specificity their binding is often correlatedwith that of regulators important for the specific cell lineWe analyze several such factors (BCL BDP1 CCNT2EP300 FOXA HDAC2 HMGN3 TATA TCF12 andTRIM28) and find that key cell line regulators can be

identified by examining enrichments in cell lines-specificdata setsAs a transcriptional coactivator EP300 interacts with

numerous TFs [reviewed in (68)] and has been shown tohave binding that can identify tissue-specific enhancers(69) Conversely FOXA has a DNA binding domainand plays an important role in liver development andfunction (70) and is a pioneer factor responsible forpriming chromatin for the binding of other factors(reviewed in (71)] Other proteins involved in chromatinrestructuring include HDAC2 which transcriptionallyrepresses through histone deacetylation (72) andHMGN3 (73) Further two factor groups are directlyinvolved in transcription including three RNA Pol3subunits (BDP1 RPC155 and TFIIIC-110) and CCNT2which is involved in the elongation of Pol2 (74)Eight of these ten factor groups have at least one data

set in K562 (erythroleukemia cells) and for four of thesewe discover motifs that match the GATA consensuswhich is then enriched specifically in the K562 data sets(BCL_disc5 CCNT2_disc1 HDAC2_disc1 andHMGN3_disc2) GATA has a known important role inK562 (75) and we also have previously found an associ-ation with GATA motifs and chromatin state-derived en-hancers for K562 cells (76) We also find three additionalmotifs that have enrichment specific to the factor grouprsquosK562 data set BDP1_disc1 a 23-nt motif that containsthe STAT consensus HMGN3_disc1 which matches theTRE and TRIM28_disc2 which matches no known motifand may be associated with an uncharacterized regulatoractive in this cell lineLikewise for GM12878 an EBV-mediated

lymphoblastoid cell line we find three discovered motifs(BCL_disc4 EP300_disc5 and TCF12_disc4) that matchthe known IRF consensus IRF4 has been shown to beimportant in the establishment of these cell lines (77) andthe family is an important player in immune cells (78)This enrichment is also consistent with our previousstudy using epigenetic marks (76) where we found IRFto be the strongest enriched motif in GM12878-specificenhancers We also find GM12878-specific enrichmentfor motifs matching NFKB (BCL_disc6) and POU2F2(TATA_disc9) consistent with the known biology ofthese factors (7980)The motifs we find specifically enriched in HepG2 (liver

carcinoma) data sets match the known motifs for FOXA(EP300_disc3 HDAC2_disc2 and TCF12_disc2) HNF4(FOXA_disc5 and HDAC2_disc5) and CEBP(EP300_disc26) three key liver regulators (7081) Wefind motifs with enrichments specific to H1-hESC whichinclude matches to the pluripotency factor POU2F2(TATA_disc9) the near universally expressed repressorREST (BCL_disc3 and HDAC2_disc4) and key metabolicregulator NRF1 (HDAC2_disc4) We find additional cellline-specific enrichments for FOXA_disc3 (TCF12) inECC-1 FOXA_disc4 (STAT) in both T-47D and ECC-1and EP300_disc26 (CEBP) and EP300_disc4 (ETS) withenrichment in the HeLa-S3 data setEven for these factors we find motifs that are consist-

ently enriched across assayed cell lines for a given factorFOXA_disc1 for example matches the known FOXA

Nucleic Acids Research 2013 7

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 8: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

motif indicating that FOXArsquos own motif also plays animportant role in its specificity Most of the motifs weidentify for RNA Pol2 machinery (TAF1 GTF2BGTF2F1 and TBP) are enriched in all cell lines includingthe known TATAAA motif (TATA_known2) AlsoTATA_disc1 disc6 and disc8 have consistent enrichmentand match the known motifs for YY1 (which is known tobe important in establishing transcription) (82) NFY andETS The top discovered motif BCL_disc1 matches theknown ETS motif and is also enriched across data setsInterestingly we find that the TRE motif is found and

enriched in a cell line-specific manner for several factorsbut for different cell lines For example HMGN3_disc1 isenriched in K562 BCL_disc2 has the highest enrichmentin GM12878 TRIM28_disc1 is only enriched in theHEK2932 and U2OS cell lines and EP300_disc7 has en-richment in the neuroblastoma cell line SK-N-SH-RA andHeLa-S3 This suggests that perhaps AP1 or other factorsrecognizing TRE are selectively interacting with theseproteins depending on the cell line

Novel motifs raise possibility of unknown regulators

Although we are able to putatively explain the majorityof the motifs we discover as either matches to previouslyknown motifs or low complexity sequences we do identify30 putative novel motifs (Figure 5) We placedthese into eight groups on the basis of their similarityNovel1 (BRCA1_disc1 CHD2_disc1 ETS_disc36NR3C1_disc3 and ZBTB33_disc1-4) Novel2 (EGR1_disc4 ETS_disc157 SETDB1_disc1 SIX5_disc1-3SMARC_disc2 and ZNF143_disc1-3) Novel3(SP2_disc3 TCF12_disc3 and ZBTB7A_disc2) Novel4(RFX5_disc3) Novel5 (BDP1_disc2) Novel6

(TATA_disc57) Novel7 (TRIM28_disc2) and Novel8(E2F_disc6)

Novel1 (using ZBTB33_disc1) is highly enriched in atleast one data set for each of the factor groups for whichit is found (BRCA1 CHD2 ETS NR3C1 and ZBTB33)All five factor groups except CHD2 have at least oneknown motif and for each of these data sets Novel1 ismore enriched in at least one data set than any knownmotif [the result for NR3C1 is questionable because onlyone data set has enrichment and that data set has beenindependently flagged as problematic see httpwwwencodeprojectorgencodequalityMetricshtml] Theshared role of BRCA1 and CHD2 in DNA damagerepair (8384) suggests that Novel1 may be involved inthis or other shared roles for these factors and highlightsthe utility in sharedmotif enrichment even outside ofmotifsdirectly tied to a factor

Similarly for SIX5 we see only weak enrichment of theknown SIX5 motif and fail to discover a motif similar toit However Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562 GM12878and H1-hESC) Novel2 also shows high enrichment indata sets for which it was not rediscovered includingATF3 (all data sets have gt20-fold enrichment withGM12878 having 106-fold) and NRF1 (all data setshave gt30-fold enrichment) Moreover the knownZNF143 motif which is 4-fold enriched in the oneZNF143 data set is also not recovered but Novel2 is24-fold enriched The breath of data sets sharing thismotif suggests it may be recognized by an important yetunknown or under-characterized regulator

Like the known ZBTB7A motif Novel3 (usingSP2_disc3) is largely poly-G which causes us to underesti-mate its enrichment due to our shuffling process Despite

Figure 5 Putative novel motifs We find eight motifs that are not represented in the literature motifs we collected three of which are found for atleast two factor groups These patterns may represent the binding specificity of the factors for which they are discovered or for other factors thatcooperate with them

8 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 9: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

this however it does show enrichment in several data setsincluding for the factor groups for which it was identifiedThis motif shows similarity to other poly-G motifs suchas known SP1 motifs but appears to be distinct due to itsother bases

Novel4 (RFX5_disc3) shows moderate but consistent(2- to 6-fold) enrichment across the RFX5 data sets Theconsensus is composed of two of the same components asthe known motifs (AAC and TGA) but ordered differ-ently Consequently it may represent the binding specifi-city of for example an alternative isoform of RFX5 Theremaining motifs (Novel5-8) were found for factors thatshow cell line-specific enrichments Consequently thesemay represent specificities for regulators that are previ-ously unidentified

Experimental and evolutionary validation of novel motifs

Following the motif discovery and selection of theseputative novel motifs a study released hundreds of newmotifs generated using high-throughput SELEX (16) Twoof the putative novel motifs described in this section matchmotifs generated by (16) Novel1 matches the motiffor ETV6 and Novel6 matches ZBED1 Although wehave incorporated these SELEX motifs into ourresource we continue to include Novel1 and Novel6 asputative novel motifs because they were identifiedwithout knowledge of these new specificities and therebystrengthen the evidence for the remaining novel motifs

Four of these putatively novel motif groups (Novel1ndash36) match motifs that were previously identified usingconservation signals across four mammals (85)(Supplementary Table S5) Therefore this studyprovides additional support for these conservation-basedmotifs and conversely the motifs identified here gaincomparative evidence The relatively few distinct novelpatterns that are found in this study and the comparativesupport for many of the few that are found suggests thatthere may be a limited number of human TF motifs withmany instances and which interact with one of the assayedfactors that remain unknown

DISCUSSION

In this article we provide a systematic and comprehensivecollection of motifs for hundreds of human TF bindingdata sets TF binding can be complex with a factorrecognizing several or motifs or binding in the apparentabsence of any motif [reviewed in (86)] We also show thatit is possible to identify cofactors that may be partiallyresponsible for binding or function

This motif resource has already been used in severalarticles while this article was in preparation demon-strating its value for high-throughput analyses Ourmotifs are being matched at low stringency to identifypeaks that are void of any motif to understand the mech-anism through which motif-less peaks are generated (8)The collection of known motifs and enrichment tech-niques we present here was also used as a secondaryvalidation of peaks (87) Because having the motifsallows for more precisely determining the bases

responsible for binding these motifs enable analyzesinvolving population data (88) and for interpretingGWAS data (89) Two other ENCODE articles alsoperform motif discovery (90) produce a non-redundantlist of discovered motifs but do not perform an extensiveanalysis of the relationships between factors and (91) useDNaseI footprinting data to identify relevant motifsHaving a motif catalog is also the first step in identify-

ing high-quality computational targets of factors whichmay allow the identification of binding sites that were forexample not found in the conditions assayed Twopopular strategies are used for this purpose One is usingclustering of motif instances for factors known to cooper-ate to form cis-regulatory modules (9293) This resourceis well suited for this purpose because it naturally providessets of motifs that are likely to cooperateA second approach is the use of conservation on many

closely related species (8594ndash97) This can be performedreadily on these motif instances because a dense tree ofmammalian species has been sequenced readily permittingtheir alignment and measuring selection of a near-nucleo-tide level Because changes in the underlying motifmatches are largely responsible for changes in bindingacross species (98) evolutionary-based approaches onthe motif instances may be a means to deal with thehigh rate of non-functional binding (99ndash101)

AVAILABILITY

A web interface along with data files and accompanyingsoftware is available at httpcompbiomiteduencode-motifs

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [102ndash110]

ACKNOWLEDGEMENTS

The authors thank Ewan Birney Christopher BristowLuke Ward Jason Ernst Anshul Kundaje Gerald Quonand other members of the Kellis Laboratory for helpfuldiscussions

FUNDING

National Institutes of Health (NIH) [HG004037HG007000 and HG006991] Funding for open accesscharge NIH [HG004037 HG007000 and HG006991]

Conflict of interest statement None declared

REFERENCES

1 SolomonMJ LarsenPL and VarshavskyA (1988) MappingproteinDNA interactions in vivo with formaldehyde evidence thathistone H4 is retained on a highly transcribed gene Cell 53937ndash947

2 RenB RobertF WyrickJJ AparicioO JenningsEGSimonI ZeitlingerJ SchreiberJ HannettN KaninE et al

Nucleic Acids Research 2013 9

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 10: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

(2000) Genome-wide location and function of DNA bindingproteins Science 290 2306ndash2309

3 IyerVR HorakCE ScafeCS BotsteinD SnyderM andBrownPO (2001) Genomic binding sites of the yeast cell-cycletranscription factors SBF and MBF Nature 409 533ndash538

4 RobertsonG HirstM BainbridgeM BilenkyM ZhaoYZengT EuskirchenG BernierB VarholR DelaneyA et al(2007) Genome-wide profiles of STAT1 DNA association usingchromatin immunoprecipitation and massively parallel sequencingNat Methods 4 651ndash657

5 QiY RolfeA MacIsaacKD GerberGK PokholokDZeitlingerJ DanfordT DowellRD FraenkelE JaakkolaTSet al (2006) High-resolution computational models of genomebinding events Nat Biotechnol 24 963ndash970

6 GuoY PapachristoudisG AltshulerRC GerberGKJaakkolaTS GiffordDK and MahonyS (2010) Discoveringhomotypic binding events at high spatial resolutionBioinformatics 26 3028ndash3034

7 LiXY MacArthurS BourgonR NixD PollardDAIyerVN HechmerA SimirenkoL StapletonM HendriksCLet al (2008) Transcription factors bind thousands of active andinactive regions in the Drosophila blastoderm PLoS Biol 6 e27

8 The ENCODE Project Consortium (2012) An integratedencyclopedia of DNA elements in the human genome Nature489 57ndash74

9 MoormanC SunLV WangJ deWitE TalhoutWWardLD GreilF LuX WhiteKP BussemakerHJ et al(2006) Hotspots of transcription factor colocalization in thegenome of Drosophila melanogaster Proc Natl Acad Sci USA103 12027ndash12032

10 GersteinMB KundajeA HariharanM LandtSG YanK-K ChengC MuXJ KhuranaE RozowskyJ AlexanderRet al (2012) Architecture of the human regulatory networkderived from ENCODE data Nature 489 91ndash100

11 MatysV FrickeE GeffersR GosslingE HaubrockMHehlR HornischerK KarasD KelAE Kel-MargoulisOVet al (2003) TRANSFAC(R) transcriptional regulation frompatterns to profiles Nucleic Acids Res 31 374ndash378

12 SandelinA AlkemaW EngstrmP WassermanWW andLenhardB (2004) JASPAR an open-access database foreukaryotic transcription factor binding profiles Nucleic AcidsRes 32 D91ndashD94

13 BergerMF PhilippakisAA QureshiAM HeFS EstepPWand BulykML (2006) Compact universal DNA microarrays tocomprehensively determine transcription-factor binding sitespecificities Nat Biotechnol 24 1429ndash1435

14 BadisG BergerMF PhilippakisAA TalukderSGehrkeAR JaegerSA ChanET MetzlerG VedenkoAChenX et al (2009) Diversity and complexity in DNArecognition by transcription factors Science 324 1720ndash1723

15 BergerMF BadisG GehrkeAR TalukderSPhilippakisAA Pea-CastilloL AlleyneTM MnaimnehSBotvinnikOB ChanET et al (2008) Variation inhomeodomain DNA binding revealed by high-resolution analysisof sequence preferences Cell 133 1266ndash1276

16 JolmaA YanJ WhitingtonT ToivonenJ NittaKRRastasP MorgunovaE EngeM TaipaleM WeiG et al(2013) DNA-binding specificities of human transcription factorsCell 152 327ndash339

17 HughesJD EstepPW TavazoieS and ChurchGM (2000)Computational identification of Cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

18 LiuXS BrutlagDL and LiuJS (2002) An algorithm forfinding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments Nat Biotechnol 20835ndash839

19 BaileyTL and ElkanC (1994) Fitting a mixture model byexpectation maximization to discover motifs in biopolymersProc Int Conf Int Syst Mol Biol 2 28ndash36

20 PavesiG MauriG and PesoleG (2001) An algorithm forfinding signals of unknown length in DNA sequencesBioinformatics 17 S207ndashS214

21 EttwillerL PatenB RamialisonM BirneyE and WittbrodtJ(2007) Trawler de novo regulatory motif discovery pipeline forchromatin immunoprecipitation Nat Methods 4 563ndash565

22 CheD JensenS CaiL and LiuJS (2005) BEST binding-siteestimation suite of tools Bioinformatics 21 2909ndash2911

23 RomerKA KayombyaG and FraenkelE (2007) WebMOTIFSautomated discovery filtering and scoring of DNA sequencemotifs using multiple programs and Bayesian approaches NucleicAcids Res 35 W217ndashW220

24 SunH YuanY WuY LiuH LiuJS and XieH (2010)Tmod toolbox of motif discovery Bioinformatics 26 405ndash407

25 CrooksGE HonG ChandoniaJ and BrennerSE (2004)WebLogo a sequence logo generator Genome Res 141188ndash1190

26 Bar-JosephZ GiffordDK and JaakkolaTS (2001) Fastoptimal leaf ordering for hierarchical clustering Bioinformatics17 S22ndashS29

27 BairochA (2004) The Universal Protein Resource (UniProt)Nucleic Acids Res 33 D154ndashD159

28 PruittKD and MaglottDR (2001) RefSeq and LocusLinkNCBI gene-centered resources Nucleic Acids Res 29 137ndash140

29 MaglottD OstellJ PruittKD and TatusovaT (2007) Entrezgene gene-centered information at NCBI Nucleic Acids Res 35D26ndashD31

30 FrietzeS LanX JinVX and FarnhamPJ (2010) Genomictargets of the KRAB and SCAN domain-containing zinc fingerprotein 263 J Biol Chem 285 1393ndash1403

31 KarinM LiuZg and ZandiE (1997) AP-1 function andregulation Curr Opin Cell Biol 9 240ndash246

32 KawanaM LeeME QuertermousEE and QuertermousT(1995) Cooperative interaction of GATA-2 and AP1 regulatestranscription of the endothelin-1 gene Mol Cell Biol 154225ndash4231

33 WangW XueY ZhouS KuoA CairnsBR andCrabtreeGR (1996) Diversity and specialization of mammalianSWISNF complexes Genes Dev 10 2117ndash2130

34 ItoT YamauchiM NishinaM YamamichiN MizutaniTUiM MurakamiM and IbaH (2001) Identification ofSWISNF complex subunit BAF60a as a determinant of thetransactivation potential of FosJun dimers J Biol Chem 2762852ndash2857

35 NateriAS Spencer-DeneB and BehrensA (2005) Interaction ofphosphorylated c-Jun with TCF4 regulates intestinal cancerdevelopment Nature 437 281ndash285

36 MostoslavskyR ChuaKF LombardDB PangWWFischerMR GellonL LiuP MostoslavskyG FrancoSMurphyMM et al (2006) Genomic instability and aging-likephenotype in the absence of mammalian SIRT6 Cell 124315ndash329

37 HuangY MyersSJ and DingledineR (1999) Transcriptionalrepression by REST recruitment of Sin3A and histonedeacetylase to neuronal genes Nat Neurosci 2 867ndash872

38 NascimentoEM CoxCL MacarthurS HussainSTrotterM BlancoS SurajM NicholsJ KblerB BenitahSAet al (2011) The opposing transcriptional functions of Sin3a andc-Myc are required to maintain tissue homeostasis Nat CellBiol 13 1395ndash1405

39 ZervosAS GyurisJ and BrentR (1993) Mxi1 a protein thatspecifically interacts with Max to bind Myc-Max recognition sitesCell 72 223ndash232

40 Li-WeberM DavydovI KrafftH and KrammerP (1994) Therole of NF-Y and IRF-2 in the regulation of human IL-4 geneexpression J Immunol 153 4122ndash4133

41 ScottE SimonM AnastasiJ and SinghH (1994) Requirementof transcription factor PU1 in the development of multiplehematopoietic lineages Science 265 1573ndash1577

42 VillardJ PerettiM MasternakK BarrasE CarettiGMantovaniR and ReithW (2000) A functionally essentialdomain of RFX5 mediates activation of major histocompatibilitycomplex class II promoters by promoting cooperative bindingbetween RFX and NF-Y Mol Cell Biol 20 3364ndash3376

43 YuL WuQ YangCP and HorwitzSB (1995) Coordinationof transcription factors NF-Y and CEBP beta in the regulationof the mdr1b promoter Cell Growth Differ 6 1505ndash1512

10 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 11: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

44 RoderK WolfS LarkinK and SchweizerM (1999) Interactionbetween the two ubiquitously expressed transcription factorsNF-Y and Sp1 Gene 234 61ndash69

45 CarettiG SalsiV VecchiC ImbrianoC and MantovaniR(2003) Dynamic recruitment of NF-Y and histoneacetyltransferases on cell-cycle promoters J Biol Chem 27830435ndash30440

46 IvanovVN BhoumikA KrasilnikovM RazROwen-SchaubLB LevyD HorvathCM and RonaiZ (2001)Cooperation between STAT3 and c-jun suppresses fastranscription Mol Cell 7 517ndash528

47 ChoiS ChoY KimH and ParkJ (2007) ROS mediate thehypoxic repression of the hepcidin gene by inhibiting CEBPalphaand STAT-3 Biochem Biophys Res Commun 356 312ndash317

48 SementchenkoVI and WatsonDK (2000) Ets target genespast present and future Oncogene 19 6533ndash6548

49 RothbcherU BertrandV LamyC and LemaireP (2007) Acombinatorial code of maternal GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidianectoderm Development 134 4023ndash4032

50 TaylorJM Dupont-VersteegdenEE DaviesJD HassellJAHoulJD GurleyCM and PetersonCA (1997) A role for theETS domain transcription factor PEA3 in myogenicdifferentiation Mol Cell Biol 17 5550ndash5558

51 OrsquoGeenH LinY XuX EchipareL KomashkoVM HeDFrietzeS TanabeO ShiL SartorMA et al (2010) Genome-wide binding of the orphan nuclear receptor TR4 suggests itsgeneral role in fundamental biological processes BMC Genomics11 689

52 AdamsB DrflerP AguzziA KozmikZ UrbnekPMaurer-FogyI and BusslingerM (1992) Pax-5 encodes thetranscription factor BSAP and is expressed in B lymphocytes thedeveloping CNS and adult testis Genes Dev 6 1589ndash1607

53 FitzsimmonsD HodsdonW WheatW MairaSM WasylykBand HagmanJ (1996) Pax-5 (BSAP) recruits Ets proto-oncogenefamily proteins to form functional ternary complexes on a B-cell-specific promoter Genes Dev 10 2198ndash2211

54 DudekH TantravahiRV RaoVN ReddyES andReddyEP (1992) Myb and Ets proteins cooperate intranscriptional activation of the mim-1 promoter Proc NatlAcad Sci USA 89 1291ndash1295

55 MazarsR Gonzalez-de-PeredoA CayrolC LavigneAVogelJL OrtegaN LacroixC GautierV HuetG RayAet al (2010) The THAP-zinc finger protein THAP1 associateswith coactivator HCF-1 and O-GlcNAc transferase a linkbetween DYT6 and DYT3 dystonias J Biol Chem 28513364ndash13371

56 YuH MashtalirN DaouS Hammond-MartelI RossJSuiG HartGW RauscherFJR DrobetskyE MilotE et al(2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternarycomplex with YY1 and HCF-1 and is a critical regulator of geneexpression Mol Cell Biol 30 5071ndash5085

57 LooijengaLH StoopH deLeeuwHP deGouveia BrazaoCAGillisAJ vanRoozendaalKE vanZoelenEJ WeberRFWolffenbuttelKP vanDekkenH et al (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human germ celltumors Cancer Res 63 2244ndash2250

58 LohY WuQ ChewJ VegaVB ZhangW ChenXBourqueG GeorgeJ LeongB LiuJ et al (2006) The Oct4and Nanog transcription network regulates pluripotency in mouseembryonic stem cells Nat Genet 38 431ndash440

59 YiF and MerrillBJ (2007) Stem cells and TCF proteins a rolefor beta-catenin-independent functions Stem Cell Rev 3 39ndash48

60 PhillipsJE and CorcesVG (2009) CTCF master weaver of thegenome Cell 137 1194ndash1211

61 McKayMJ TroelstraC vanderP KanaarR SmitBHagemeijerA BootsmaD and HoeijmakersJH (1996)Sequence conservation of therad21 SchizosaccharomycespombeDNA double-strand break repair gene in human andmouse Genomics 36 305ndash315

62 WendtKS YoshidaK ItohT BandoM KochBSchirghuberE TsutsumiS NagaeG IshiharaK MishiroTet al (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor Nature 451 796ndash801

63 RubioED ReissDJ WelcshPL DistecheCMFilippovaGN BaligaNS AebersoldR RanishJA andKrummA (2008) CTCF physically links cohesin to chromatinProc Natl Acad Sci USA 105 8309ndash8314

64 JelinicP StehleJ and ShawP (2006) The testis-specificfactor CTCFL cooperates with the protein methyltransferasePRMT7 in H19 imprinting control region methylationPLoS Biol 4 e355

65 BischofLJ KagawaN MoskowJJ TakahashiYIwamatsuA BuchbergAM and WatermanMR (1998)Members of the Meis1 and Pbx homeodomain protein familiescooperatively bind a cAMP-responsive sequence (CRS1) fromBovineCYP17 J Biol Chem 273 7941ndash7948

66 KappelA SchlaegerTM FlammeI OrkinSH RisauW andBreierG (2000) Role of SCLTal-1 GATA and ets transcriptionfactor binding sites for the regulation of flk-1 expression duringmurine vascular development Blood 96 3078ndash3085

67 MouthonMA BernardO MitjavilaMT RomeoPHVainchenkerW and Mathieu-MahulD (1993) Expression of tal-1and GATA-binding proteins during human hematopoiesis Blood81 647ndash655

68 ChanHM and La ThangueNB (2001) p300CBP proteinsHATs for transcriptional bridges and scaffolds J Cell Sci 1142363ndash2373

69 ViselA BlowMJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2009)ChIP-seq accurately predicts tissue-specific activity of enhancersNature 457 854ndash858

70 CostaRH KalinichenkoVV HoltermanAL and WangX(2003) Transcription factors in liver development differentiationand regeneration Hepatology 38 1331ndash1347

71 ZaretKS and CarrollJS (2011) Pioneer transcription factorsestablishing competence for gene expression Genes Dev 252227ndash2241

72 JohnsonCA and TurnerBM (1999) Histone deacetylasescomplex transducers of nuclear signals Semin Cell Dev Biol 10179ndash188

73 FurusawaT and CherukuriS (2010) Developmental function ofHMGN proteins Biochim Biophys Acta 1799 69ndash73

74 PengJ ZhuY MiltonJT and PriceDH (1998) Identificationof multiple cyclin subunits of human P-TEFb Genes Dev 12755ndash762

75 PartingtonGA and PatientRK (1999) Phosphorylation ofGATA-1 increases its DNA-binding affinity and is correlated withinduction of human K562 erythroleukaemia cells Nucleic AcidsRes 27 1168ndash1175

76 ErnstJ KheradpourP MikkelsenTS ShoreshN WardLDEpsteinCB ZhangX WangL IssnerR CoyneM et al(2011) Mapping and analysis of chromatin state dynamics in ninehuman cell types Nature 473 43ndash49

77 XuD ZhaoL Del ValleL MiklossyJ and ZhangL (2008)Interferon regulatory factor 4 is involved in Epstein-Barr virus-mediated transformation of human B lymphocytes J Virol 826251ndash6258

78 PaunA and PithaPM (2007) The IRF family revisitedBiochimie 89 744ndash753

79 CorcoranLM KarvelasM NossalGJ YeZS JacksT andBaltimoreD (1993) Oct-2 although not required for early B-celldevelopment is critical for later B-cell maturation and forpostnatal survival Genes Dev 7 570ndash582

80 BaeuerlePA and HenkelT (1994) Function and activation ofNF-kappa B in the immune system Annu Rev Immunol 12141ndash179

81 LeeCS FriedmanJR FulmerJT and KaestnerKH (2005)The initiation of liver development is dependent on Foxatranscription factors Nature 435 944ndash947

82 SetoE ShiY and ShenkT (1991) YY1 is an initiator sequence-binding protein that directs and activates transcription in vitroNature 354 241ndash245

83 NagarajanP OnamiTM RajagopalanS KaniaS DonnellRand VenkatachalamS (2009) Role of chromodomain helicaseDNA-binding protein 2 in DNA damage response signaling andtumorigenesis Oncogene 28 1053ndash1062

Nucleic Acids Research 2013 11

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 12: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

84 DengC (2003) Roles of BRCA1 in DNA damage repair a linkbetween development and cancer Hum Mol Genet 12113Rndash123R

85 XieX LuJ KulbokasEJ GolubTR MoothaV Lindblad-TohK LanderES and KellisM (2005) Systematic discovery ofregulatory motifs in human promoters and 3[prime] UTRs bycomparison of several mammals Nature 434 338ndash345

86 FarnhamPJ (2009) Insights from genomic profiling oftranscription factors Nat Rev Genet 10 605ndash616

87 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of the ENCODEand modENCODE consortia Genome Res 22 1813ndash1831

88 SpivakovM AkhtarJ KheradpourP BealK GirardotCKoscielnyG HerreroJ KellisM FurlongEE and BirneyE(2012) Analysis of variation at transcription factor binding sitesin Drosophila and humans Genome Biol 13 R49

89 WardLD and KellisM (2011) HaploReg a resource forexploring chromatin states conservation and regulatory motifalterations within sets of genetically linked variants Nucleic AcidsRes 40 D930ndashD934

90 WangJ ZhuangJ IyerS LinX WhitfieldTW GrevenMCPierceBG DongX KundajeA ChengY et al (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors Genome Res22 1798ndash1812

91 NephS VierstraJ StergachisAB ReynoldsAP HaugenEVernotB ThurmanRE JohnS SandstromR JohnsonAKet al (2012) An expansive human regulatory lexicon encoded intranscription factor footprints Nature 489 83ndash90

92 BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploitingtranscription factor binding site clustering to identifycis-regulatory modules involved in pattern formation in theDrosophila genome Proc Natl Acad Sci USA 99 757ndash762

93 SchroederMD PearceM FakJ FanH UnnerstallUEmberlyE RajewskyN SiggiaED and GaulU (2004)Transcriptional control in the segmentation gene network ofDrosophila PLoS Biol 2 e271

94 KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

95 MosesA ChiangD PollardD IyerV and EisenM (2004)MONKEY identifying conserved transcription-factor bindingsites in multiple alignments using a binding site-specificevolutionary model Genome Biol 5 R98

96 KheradpourP StarkA RoyS and KellisM (2007) Reliableprediction of regulator targets using 12 Drosophila genomesGenome Res 17 1919ndash1931

97 Lindblad-TohK GarberM ZukO LinMF ParkerBJWashietlS KheradpourP ErnstJ JordanG MauceliE et al(2011) A high-resolution map of human evolutionary constraintusing 29 mammals Nature 478 476ndash482

98 SchmidtD WilsonMD BallesterB SchwaliePCBrownGD MarshallA KutterC WattSMartinez-JimenezCP et al (2010) Five-vertebrate ChIP-seqReveals the evolutionary dynamics of transcription factorbinding Science 328 1036ndash1040

99 BoyerLA LeeTI ColeMF JohnstoneSE LevineSSZuckerJP GuentherMG KumarRM MurrayHLJennerRG et al (2005) Core transcriptional regulatory circuitryin human embryonic stem cells Cell 122 947ndash956

100 LeeTI JennerRG BoyerLA GuentherMG LevineSSKumarRM ChevalierB JohnstoneSE ColeMFIsonoKI et al (2006) Control of developmentalregulators by polycomb in human embryonic stem cells Cell125 301ndash313

101 MacArthurS LiX LiJ BrownJ ChuHC ZengLGrondonaB HechmerA SimirenkoL KeranenS et al(2009) Developmental roles of 21 Drosophila transcriptionfactors are determined by quantitative differences in binding toan overlapping set of thousands of genomic regions GenomeBiol 10 R80

102 PietrokovskiS (1996) Searching databases of conserved sequenceregions by aligning protein multiple-alignments Nucleic AcidsRes 24 3836ndash3845

103 GrayKA DaughertyLC GordonSM SealRLWrightMW and BrufordEA (2013) Genenamesorgthe HGNC resources in 2013 Nucleic Acids Res 41D545ndashD552

104 KharchenkoPV TolstorukovMY and ParkPJ (2008)Design and analysis of ChIP-seq experiments for DNA-bindingproteins Nat Biotech 26 1351ndash1359

105 KentWJ SugnetCW FureyTS RoskinKM PringleTHZahlerAM and HausslerD (2002) The human genomebrowser at UCSC Genome Res 12 996ndash1006

106 HarrowJ FrankishA GonzalezJM TapanariEDiekhansM KokocinskiF AkenBL BarrellD ZadissaASearleS et al (2012) GENCODE The reference human genomeannotation for The ENCODE Project Genome Res 221760ndash1774

107 TouzetH and VarreJ (2007) Efficient and accurate P-valuecomputation for position weight matrices Algorithms Mol Biol2 15

108 WilsonEB (1927) Probable Inference the Law of Successionand Statistical Inference J Am Stat Assoc 22 209ndash212

109 MahonyS AuronPE and BenosPV (2007) DNA familialbinding profiles made easy comparison of various motifalignment and clustering strategies PLoS Comput Biol 3 e61

110 SandelinA and WassermanWW (2004) Constrainedbinding site diversity within families of transcription factorsenhances pattern discovery bioinformatics J Mol Biol 338207ndash215

12 Nucleic Acids Research 2013

at MIT

Libraries on D

ecember 14 2013

httpnaroxfordjournalsorgD

ownloaded from

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 13: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearsonrsquos correlation of the PWM made into a 4xN vector

across all offsets and both orientations padding unmatched positions with Nrsquos (102) Motifs are consid-

ered a match if they have a similarity of at least 075 weak matches or variants are determined through

manual inspection

Collecting known motifs

Known motifs were collected from large scale data sets or databases We collected human mouse

and rat motifs from Transfac (version 113) (11) vertebrate motifs from Jaspar (version 2008) (12)

and large scale systematic motifs generated using Protein Binding Microarrays (13 14 15) and high-

throughput SELEX (16) HGNC gene names (103) were used to describe the names of motifs whenever

possible

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2Pol3) were taken from the January 2011 freeze of EN-

CODE (8) The processing for these data sets is described in (104 10) and the original data sets are

available on the website accompanying this paper To avoid potentially confounding issues we excluded

from our analysis (1) the Y and mitochondrial chromosomes (2) the hg19 rmsk and simple repeat

tracks from UCSC (created on April 27 2009) (105) and (3) protein coding regions exons for non-

coding genes and 3rsquo untranslated regions taken from GENCODE v4 (106) Like with the motif names

HGNC gene names (103) are used to describe the data sets

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en-

richment and limit over-fitting The top 250 peaks for the first partition were used in motif discovery

because high intensity peaks often have better enrichment for motifs (101) Five tools were run inde-

pendently run on each data set AlignACE (v40 with default parameters) (17) MDscan (v2004 with

default parameters) (18) MEME (v357 with a maximum of 10 iterations and -maxw 10 15 and 25

for the three motifs) (19) Weeder (v142 with option large) (20) and Trawler (v12 with 250 random

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 14: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

intergenic blocks for background) (21) Any motifs beyond the top three for any method on one data

set were discarded

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4minus8

as determined by TFM-Pvalue (107) with the intent to imitate the frequency a fully specified 8-mer

would match the genome

Enrichments are always computed by taking a foreground region (ie bound regions for a data set)

and comparing it to a background region (eg Intergenic non-repetitive regions) where regions in the

foreground but not in the background are discarded For a given motif the fraction of matches that fall

within the foreground is computed and divided by the corresponding fraction for shuffled control motifs

(control motif generation details in (97)) To prevent spuriously high enrichments due to small counts

we compute a binomial confidence interval (with z=15) (108) around both motif and control fractions

and take the extreme which leads to the enrichment closest to 1 (the software available on the website

implements this enrichment metric)

For each factor group we order the motifs from all discovery programs by the enrichment in the

second random partition for the discovery data set For this step only we restrict our analysis to 10

of the background regions to reduce the amount of computation We then select discovered motifs

for each factor by this rank order discarding any motif that matches a previous one with similarity

greater than 075 We supplement these discovered motifs with the known motifs from the literature

(described above) and rematch the motifs to the complete background regions and produce comparable

enrichments for all data sets

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance

(109 110) this happens infrequently for the 075 similarity threshold with our database of motifs

When two motifs in the database match scrambling one of the motifs 100 times leads to more than 5

matches only 64 of the time

Moreover a given motif in the database is 347 times more likely to match another motif in a

database than a scramble While this statistic is hard to interpret due to on one hand the significant

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 15: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

redundancy in the database and on the other the partial exclusion of similar discovered motifs it

demonstrates that spurious motif matches should be relatively uncommon

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101) we find that our results

are largely robust to restricting to only the 1000 strongest peaks per data set (89 of data sets have at

least 1000 peaks) The Pearson correlation of motif enrichment (for all motifs) for one data set when

using all peaks or only the top 1000 has a median of 089 (middle 50 085 to 095)

Moreover the enrichment of the first discovered motif for a factor group has strong correlation

between the complete and restrict data sets for the 55 factor groups with at least two data sets and a

discovered motif the median correlation of the first discovered motifrsquos enrichment is 097 (middle 50

081 to 100)

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 16: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

00

10

20

bits

1

CG

2

ACG

3

T

A

CG

4

T

C

A

G

5

CAG

6

G7

GC

8

T

A

C

G

9

CG

10

A

CG

AP1 disc10

00

10

20

bits

1

T

C

GA

2

T

G

CA

3

A

C4

A

T

GC

5

A

GC

6

A

T

CG

7

T

ACG

8

A

T

G

C

9

T

A

GC

ATF3 disc3

00

10

20

bits

1

A

CG

2

ACG

3

CTG

4

A

GC

5

CAG

6

GC

7

CTG

8

TCG

9

T

CG

10

ACG

ATF3 disc4

00

10

20

bits

1

CG

2 3 4

CA

G

5

TGA

6

AG

7

G8

GC

9

G

C

AT

10

CG

11

GAT

12

G13

AG

14

CG

BDP1 disc3

00

10

20

bits

1

GC

2 3

TCG

4

GC

5 6

CG

7

AGC

8 9

CG

10 11

GC

12 13

CG

14 15

A

T

CG

16 17

GC

BHLHE40 disc2

00

10

20

bits

1

A2

A3

A4

TCG

5

ACG

6

C7

G8

GC

CHD2 disc2

00

10

20

bits

1

ACG

2

A

TCG

3

A

T

CG

4

CG

5

AGC

6

CG

7

T

CGA

8

G9

ACG

10 11

CG

CHD2 disc3

00

10

20

bits

1

A

TGC

2

TA

GC

3

T

CG

4

TG

C5

A

CTG

6

T

G

C7

T

GA

C8

A

T

GC

9

A

T

GC

10

A

T

GC

E2F disc7

00

10

20

bits

1

A

T

CG

2

A

T

GC

3

CG

4

T

A

CG

5

T

A

GC

6

A

TCG

7

GC

8

T

G

C9

T

10

CG

11

A

GC

E2F disc8

00

10

20

bits

1

GC

2

GTC

3

G

A

T

C

4 5

ATG

6

G7

G8

ACG

9

CA

10

T

A

C

G

11

GC

12

C

G

A

T

13

CG

EBF1 disc2

00

10

20

bits

1

ACG

2

GC

3

AGC

4

CG

5

G6

GCA

7

CGA

8

CG

9

A

TG

C

10

T

A

CG

ELF1 disc2

00

10

20

bits

1

CG

2

A

GC

3 4

CG

5 6

T

A

C

G

7

CG

8

A

T

GC

9

GC

10

T

A

CG

11

CG

12 13

A

CG

ELF1 disc3

00

10

20

bits

1

GC

2

CA

3

C

TAG

4

G5

CG

6 7

GC

8

TA

9

ACG

10

A

11

CG

12

T

A

G

C

13

C

T

A

G

14

CG

ESRRA disc4

00

10

20

bits

1

T

GC

2

T

C3

TGC

4

A

T

GC

5

T

ACG

6

TA

GC

7

AGC

8

TAGC

9 10

T

GC

11

TA

GC

EGR1 disc6

00

10

20

bits

1

A

CG

2 3

T

ACG

4

T

A

CG

5

ACG

6

ACG

7

A

CG

8

AGC

9

ACG

10

C

G11 12

A

CG

ETS disc9

00

10

20

bits

1

A2

G3

TGC

4

C5

A6

A7

GA

8

C9

CGATA disc5

00

10

20

bits

1

GC

2

C

A

T

3

TGC

4 5

GC

6

G

C

AT

7

CG

8

GCT

9

GTC

10

C11

ATC

12

TGC

13 14

GC

NR3C1 disc6

00

10

20

bits

1

C

TGA

2

TGA

3

TAG

4

T

GA

5

GA

6

C

TA

7

C

GA

8

TAG

9

CGA

10

GCA

11

T

GA

12

CTGA

13

TGA

14

T

CA

15

G

TA

HDAC2 disc6

00

10

20

bits

1

T

CG

2

GC

3 4

GC

5

T

A

GC

6

T

CG

7

GC

8 9

CG

10 11 12

T

CG

13

GC

14 15

CG

16

A

T

GC

HEY1 disc2

00

10

20

bits

1

CG

2

CGA

3

GC

4

CG

5

G

C

A

T

6

C

G7

ACG

8

T

A

CG

9

T

GC

10

A

CG

MYC disc9

00

10

20

bits

1

CG

2

G

C

T

A

3

G4

CG

5

AGC

6

AC

G

7

T

ACG

8

A

T

CG

9

C

T

AG

10

CG

11

ACG

MYC disc10

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

T

ACG

2

A

GC

3

C

T

G

A

4

T

GC

5

A

T

GC

6 7 8

G9

A

CG

10

T

G

C

A

11

GC

12

T

CGA

13

C

G14

T

A

GC

REST disc4

00

10

20

bits

1

GC

2

T

G

CA

3

CG

4

GC

5

CG

T

A

6

GC

7

GT

C8

G

T

A

C

9

A

T

GC

10

T

A

CG

11

CG

REST disc5

00

10

20

bits

1

AC

G2

A

GC

3

T

A

CG

4

T

AC

G5

TA

CG

6

A

GC

7

A

TCG

8

T

GC

9

GC

10

ATC

G

EP300 disc8

00

10

20

bits

1

T

CG

2

GC

3 4

CG

5

A

GC

6 7 8

GC

9 10

T

CG

11 12

CG

13 14

T

A

CG

15

T

GC

16 17

CG

EP300 disc9

00

10

20

bits

1

T

G2

G

AC

3

A

TC

G4

AGC

5

ACG

6

T

A

GC

7

A

TCG

8

GT

C9

C

T

G10

T

CPAX5 disc5

00

10

20

bits

1

TGA

2

G3

CAG

4

T

C

G

A

5

CAG

6 7

TAG

8

G9

CG

10

GAC

11

C

G

T

A

12

CG

13

C

G

A

T

14

CG

SPI1 disc3

00

10

20

bits

1

CG

2

AGC

3

G

T

A

4

G5

CG

6 7

CG

8 9

ACG

10

T

C

A

G

11

CG

12

T

A

C

G

13

ACG

POU2F2 disc2

00

10

20

bits

1

TGC

2

TA

G

3

GC

4

GA

C

T

5

TCG

6

C7

C8

TAC

9

AC

T

10

GC

11

A

G

C

T

12

CA

T

G

13

GC

RAD21 disc5

00

10

20

bits

1

CG

2 3

GC

4

G

T

A

C

5 6

GC

7 8

TC

G

A

9

CG

10

TA

11

TG

12

CG

13

A

CG

14

CT

A

15

CAG

16

T

CG

17

A

T

G

C

RAD21 disc6

00

10

20

bits

1

T

A

C

G

2

A

T

G

C

3

AT

G

4

GC

5

GC

6

T

C

A

7

TCG

8

GC

9

CT

10 11

GC

12

C

G

AT

13

G14

CG

15

GC

RAD21 disc7

00

10

20

bits

1

T

A

CG

2

GC

3

G

C

A

T

4

G5

GC

6

TGC

7

G

T

AC

8 9

GC

10

CA

T

11

T

ACG

12

CG

13

G

C

A

T

14

CG

RAD21 disc8

00

10

20

bits

1

CG

2

GC

3

ATG

C4

ATCG

5

TCG

6

GC

7

GC

8

ATGC

9

GC

10 11

GC

SRF disc2

00

10

20

bits

1

AT

2

T3

C4

TC

5

T6

CG

7

GC

8

C9

TSTAT disc6

00

10

20

bits

1

GTC

2

G

A

C

T

3

C4

G

CAT

5

GTC

6

TC

7

C8

ACT

9

A

G

CT

10

T

11

GC

12

TC

13

CSTAT disc7

00

10

20

bits

1

GC

2

GC

3 4

G5

ACG

6 7

CG

8

GC

9

CGT

10

CG

SIN3A disc5

00

10

20

bits

1

CG

2

GC

3

AT

G

4

CG

5

GC

6

CG

A

T

7

CG

8

A

9

CG

10

AGC

11

A

T

C

G

12

CG

SIN3A disc6

00

10

20

bits

1

TACG

2

T

A

CG

3 4

AT

CG

5

G

C6

A

CGT

7

CG

8

AC

G

T

9

T

GC

10

T

AGC

11 12 13

ATCG

14

AT

CG

SIN3A disc7

00

10

20

bits

1

T

CG

2

T

GC

3

GC

4

T

A

CG

5

GC

6

C

7

CG

8

TAGC

9

A

T

CG

10

CG

TATA disc10

00

10

20

bits

1

TAG

2

G3

T

C

AG

4

ACG

5

GAC

6

C

T

G

A

7

CG

8

GC

9

GAT

10

G11

CG

TCF12 disc5

00

10

20

bits

1

CG

2

AC

3

C4

C

G

AT

5

G6

G7

CA

8

A9

TTCF12 disc6

00

10

20

bits

1

A

T

GC

2

A

T

CG

3

A

T

GC

4

T

A

GC

5

A

TCG

6

GC

7

T

AGC

8

A

TCG

9

G

C10

TGC

YY1 disc3

00

10

20

bits

1

CG

2

A

GC

3 4

T

A

CG

5 6 7 8

T

CG

9

T

GC

10

T

A

CG

11

GC

12 13

CG

14

GC

15

GC

YY1 disc4

00

10

20

bits

1

GC

2 3

TGC

4

A

TGC

5

T

GC

6

GC

7 8

T

CG

9

GC

10

A

T

GC

11

TCG

12

GC

YY1 disc5

00

10

20

bits

1

T

CG

2

GC

3

T

ACG

4

ACG

5

G6

AGC

7

CG

8

TGC

9 10

CG

11

T

A

CG

ZNF143 disc4

Table S1 Low complexity or weakly enriched motifs

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 17: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

00

10

20

bits

1

A2

C3

C4

A5

G6

AG

7

TG

8

G9

G10

C11

AG

CTCF disc2

00

10

20

bits

1

A2

C3

T4

A5

G6

A7

TG

8

TC

G9

T

CAG

CTCF disc3

00

10

20

bits

1

A

T

C2

T

G

CA

3

ATGC

4

G

A

C

T

5

CGTA

6

C

TAG

7

T

GA

8

GT

9

T

G10

T

AG

11

C12

CTA

13

T

AG

14

TG

C

CTCF disc4

00

10

20

bits

1

T

A

GC

2

GC

3

C

T

G

A

4

CG

5

T

A

GC

6

C

G

T

A

7

G8

CAG

9

C

A

T

G

10

CG

11

CG

12

TA

GC

13

C

T

A

G

14

CG

CTCF disc5

00

10

20

bits

1

C2

T

GCA

3

C4

GT

5

A6

G7

G8

TCTCF disc6

00

10

20

bits

1

TAG

C2

TGA

C3

T

C

GA

4

T

A

GC

5

GAT

C6

G

CTA

7

TCA

G8

TCA

G9

ACTG

10

TAC

G

CTCF disc7

00

10

20

bits

1

TGAC

2

G

C3

TA

4

G5

C6

A7

G8

G9

TCTCF disc10

00

10

20

bits

1

CG

2

C3

G4

GC

5

AGC

6

T7

T8

TE2F disc5

00

10

20

bits

1

A

TC

2

GA

3

C4

CG

5

T

C6

A7

AT

C8

CA

G9

A

TG

C10

T

C

GA

EGR1 disc7

00

10

20

bits

1

T

C

GA

2

CTA

G3

C

AT

G4

A

GCT

5

A

TGC

6

T

C

GA

7

T

A

GC

8

ATGC

9

G

CT

10

A

CGT

ESRRA disc3

00

10

20

bits

1

A2

A3

C4

C5

CG

6

T

CA

G7

T

CGA

8

TCA

9

T

CAG

10

A

G

T

C

ETS disc4

00

10

20

bits

1

A2

A3

T4

A5

T6

T7

TG

8

GA

9

C10

CTA

FOXA disc2

00

10

20

bits

1

A2

CGT

3

C4

T5

G6

A7

T8

CA

GATA disc4

00

10

20

bits

1

A2

A3

G4

CT

5

TC

6

C7

A8

G9

ACT

HNF4 disc2

00

10

20

bits

1

A2

G3

T4

GC

5

C6

A7

GA

8

GA

9

GHNF4 disc3

00

10

20

bits

1

T2

G3

GA

4

A5

A6

C7

T8

TIRF disc6

00

10

20

bits

1

CA

2

C3

G4

T5

A

G6

GA

7

TC

8

T9

TMYC disc6

00

10

20

bits

1

A2

A3

AC

4

CGA

5

C6

G7

AT

8

A

G

MYC disc7

00

10

20

bits

1

TGA

2

AC

3

A

TGC

4

T

G5

TAG

C6

T

CG

7

ACT

8

A

T

CG

MYC disc8

00

10

20

bits

1

ACG

2

A

T

GC

3

T

AG

4

T

A

CG

5

C

A

G6

T

AGC

7

T

CAG

8

T

A

CG

9

T

ACG

10

A

CG

NRF1 disc3

00

10

20

bits

1

C

ATG

2

G3

CGA

4

CG

5

CGT

6

GC

7

C

G

T

A

8

CG

9

GC

10

C

T

A

G

11

CG

NFE2 disc4

00

10

20

bits

1

A2

C3

C4

CTGA

5

A

CT

6

G7

G8

CA

9

T

GC

10

AREST disc2

00

10

20

bits

1

G2

A

G3

TCA

4

G

C5

A6

G7

GTA

C8

CATG

9

T

AGC

10

GATC

REST disc3

00

10

20

bits

1

A2

C3

A4

T

G5

GC

6

G7

T

A

G

C

8

TREST disc6

00

10

20

bits

1

A2

C3

A4

G5

C6

TG

7

AT

8

GC

REST disc7

00

10

20

bits

1

A

TGC

2

TCGA

3

C

G4

TC

5

GTA

6

AG

C7

TGC

8

T9

TREST disc10

00

10

20

bits

1

A

GCT

2

ATG

C3

G

ACT

4

C

TGA

5

T

AC

G6

CGAT

7

C

TA

G8

ATC

G

RAD21 disc2

00

10

20

bits

1

A2

C3

C4

G

CT

5

C

AG

6

TAC

G7

C

AGT

8

CA

G9

TG

10

G

TAC

RAD21 disc4

00

10

20

bits

1

T

AGC

2

T

GCA

3

G

A

CT

4

T

GAC

5

G

CT

6

AC

7

CAG

8

T9

T

GC

10

C

A

GT

11

G

TCA

12

TAC

G

13

C

GAT

RAD21 disc9

00

10

20

bits

1

TGC

2

GA

C

T

3

TCG

4

G

A

T

C

5

TGC

6

TAC

7

TC

8

C9

AT

10

G11

CG

RAD21 disc10

00

10

20

bits

1

A2

A3

GA

4

T5

T6

C7

G

AC

8

GT

9

CT

G

STAT disc5

00

10

20

bits

1

TC

2

T3

CG

4

T5

C6

C7

TA

8

T9

G10

G11

TSIN3A disc3

00

10

20

bits

1

CA

2

G3

CG

4

TA

5

G6

C7

T8

G9

TSIN3A disc4

00

10

20

bits

1

T

GC

2

C3

G4

C5

CA

6

TGC

7

C8

TTCF12 disc3

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2

Table S2 Motifs with weak similarity to some other discovered motif These occur frequently for factorswith long known motifs that are amenable to breaking apart We do not believe that these representdistinct motifs and are likely artifacts of the discovery process

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 18: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Known motif Discovered motif

00

10

20

bits

1

C2

A3

G4

C5

AT

6

G7

G8

TC

TCF12 known130 (95)

00

10

20

bits

1

CGA

2

A

C3

TA

4

TGC

5

A

C6

AT

7

C

G8

A

T

CG

TCF12 disc1lowast

56 (112)

00

10

20

bits

1

C

TAG

2

T

AG

C3

CT

AG

4

CA

5

C6

C7

CA

8

C9

C10

T

CAG

11

GCTA

12

GTCA

ZBTB7A known413 (100)

00

10

20

bits

1

T

GCA

2

T

G3

T

C4

G5

A

C6

T

C7

A

C8

GT

C9

C10

AT

ZBTB7A disc122 (435)

00

10

20

bits

1

A

GT

C

2

T

C3

G

TCA

4

A

TGC

5

TACG

6

G

T7

A

T

C

G8

G

C

T

A

9

A

GTC

10

A

G

T

C

MXI1 known132 (96)

00

10

20

bits

1

GAC

2

C3

T

A4

C5

A

G6

T7

G8

TCG

9

A

GCT

10

A

G

CT

MXI1 disc2lowast

44 (295)

00

10

20

bits

1

TAG

2

GT

3

AG

4

CA

5

GC

6

T7

C8

A9

TG

10

TC

11

TGA

NFE2 known1564 (788)

00

10

20

bits

1

C

GA

2

A

T3

G4

G

A5

A

T

GC

6

GCAT

7

A

C8

C

GA

9

T

G10

T

CNFE2 disc1lowast

759 (1257)

00

10

20

bits

1

GA

C

2

AG

C

T

3

A

GTC

4

T

CGA

5

C

T

AG

6

C7

G

A

C8

TA

9

A10

C

T11

T

AGC

12

T

C

GA

13

TACG

14

T

G

AC

15

C

TAG

16

NFY known6726 (1106)

00

10

20

bits

1

GA

2

A

G

TC

3

A

G

TC

4

T

C

GA

5

C

TAG

6

C7

C8

A9

G

A10

C

T11

AGC

12

C

GA

13

T

CAG

14

T

GCA

15

C

TAG

NFY disc1899 (847)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5395 (386)

00

10

20

bits

1

TCA

G2

A

TGC

3

T

GAC

4

C

TA

G5

C6

C7

A8

T9

A

TGC

10

C

AT

11

CT

12

C

ATG

13

C

T

A

G

14 15

A

G

C

T

16 17 18

A

T

C

G

19

ATC

G

20

A

T

CG

21

T

AG

C

22

T

C

GA

23

CG

A

YY1 disc1475 (943)

00

10

20

bits

1

CA

T

2

G

TCA

3

GTC

4

C

AGT

5

A

CT

6

A7

C8

C9

T10

G11

G

C

AT

12

CAG

13

A

G

CT

ZEB1 known149 (34)

00

10

20

bits

1

A

GTC

2

G

A

C

T

3

TC

4

C

A5

C6

C7

T8

T

GZEB1 disc1

58 (111)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3283 (227)

00

10

20

bits

1

G

ATC

2

G

TA

3

A

GT

4

AGT

5

T

ACG

6

G

C

AT

7

G

A

CT

8

TA

9

AT

10

TG

11

GTC

12

G

C

TA

13

C

TGA

14

CTGA

15

C

GAT

POU5F1 disc1lowast

300 (248)

Known motif Discovered motif

00

10

20

bits

1

C

G

T

A

2

T

C

AG

3

G

TCA

4

C

TGA

5

G

A6

A

T

C

G7

A

CGT

8

A

G9

G

A10

C

GA

11

G

C

A12

A

G13

GCT

14

C

ATG

15

PRDM1 known2184 (233)

00

10

20

bits

1

A2

G3

AGT

4

G5

A6

GA

7

A8

G9

T10

C

ATG

PRDM1 disc1lowast

192 (251)

00

10

20

bits

1

A

C

G

T

2

TGA

3

GTAC

4

A

C5

T

GA

6

TAGC

7

G

ACT

8

G

A9

G10

TAG

11

C

A

GT

12

G13

C

T

G14

T

A

C15

A

G16

G

ATC

17

G

CAT

CTCF known2780 (2196)

00

10

20

bits

1

C

TGA

2

GTAC

3

A

C4

C

GTA

5

T

A

GC

6

G

CAT

7

T

G

C

A8

A

G9

GA

10

A

GT

11

G12

C

ATG

13

G

ATC

14

T

C

GA

15

T

A

GC

16

G

ACT

17

T

CGA

18

A

C

GT

19

G

C

TA

20

C

T

A

G

21 22

GT

C

A

CTCF disc1lowast

773 (1185)

00

10

20

bits

1

AT

G

C

2

T

AG

3

G

C

AT

4

CAT

5

C

TAG

6

TC

7

TC

8

CTGA

9

G

CAT

10

C

AG

11

TC

AG

12

ATC

13

GTA

14

C

G

A15

T

C16

T

A

C

G

RFX5 known6286 (338)

00

10

20

bits

1

T

A

GC

2

C3

TAC

4

CT

5

GA

6

G7

TC

8

A9

A10

CRFX5 disc1lowast

283 (607)

00

10

20

bits

1

T

C

A

G

2

T

CAG

3

T4

G5

C

A6

ATG

C7

ACGT

8

G

TAC

9

T

GCA

10 11

GA

C

T

AP1 known3457 (525)

00

10

20

bits

1

C

A

T

G

2

AG

3

T4

G5

A6

T

GC

7

T8

C9

A10

CT

AP1 disc3lowast

451 (820)

00

10

20

bits

1

T2

CAG

3

T4

GT

5

GT

6

GA

7

T

C8

CAT

9

A

G

CT

10

G

TA

11

C

ATG

12

AT

C

G

FOXA known6199 (144)

00

10

20

bits

1

T2

AG

3

G

T4

GT

5

GT

6

GA

7

C8

CAT

9

CT

10

TA

FOXA disc1lowast

195 (112)

00

10

20

bits

1

T

C

GA

2

G

A

C

T

3

C4

A5

G

TC

6

AG

7

T8

G9

CTGA

10

A

GTC

MYC known221048 (3342)

00

10

20

bits

1

CAG

2

T3

C4

A5

T

C6

A

G7

T8

G9

A10

GTC

MYC disc1lowast

978 (1395)

00

10

20

bits

1

T

CGA

2

T3

GT

4

TAG

5

G

A

T

C6

T

C

AG

7

TA

C8

CA

9

A10

G

CT

CEBPB known8849 (1006)

00

10

20

bits

1

CGA

2

T3

T4

TAG

5

TC

6

AG

7

TC

8

A9

A10

GCT

CEBPB disc1lowast

790 (914)

00

10

20

bits

1

TCGA

2

C

T

GA

3

T

A4

TA

5

TA

6

A

C

G7

GAC

8

A

G9

G10

A11

A12

C

G13

C

T14

C

GTA

SPI1 known4265 (305)

00

10

20

bits

1

A2

TA

3

CG

4

C

GA

5

G6

T

G7

A8

A9

CG

10

C

TSPI1 disc1lowast

246 (370)

Table S3 Complete version of Figure 4 with all factors for which a discovered motif matching a knownmotif was found When the discovered motif is not disc1 a better ranking motif was found that didnot match a literature motif lowast indicates that other discovered motifs were found even after manualexclusion of weakly redundant and low complexity motifs

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 19: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Known motif Discovered motif

00

10

20

bits

1

C

AT

G2

T

GAC

3

G

C4

A

C5

C

TA

6

C

AT

7

A8

AT

9

TA

10

C

AT

11

A

G12

CA

G

SRF known71266 (889)

00

10

20

bits

1

C2

C3

TA

4

AT

5

A6

T7

TA

8

TA

9

G10

GSRF disc11121 (938)

00

10

20

bits

1

CTGA

2

C

A

G

T

3

A

CT

4

C5

T

C6

C7

G

CTA

8

C

GTA

9

C

A

G10

C

A

G11

T

G12

T

C

GA

13

G

T

CA

14

G

ACT

EBF1 known487 (280)

00

10

20

bits

1

CT

2

C3

TC

4

C5

ATC

6

TAG

7

G8

AG

9

G10

GA

EBF1 disc173 (223)

00

10

20

bits

1

G2

G3

A

G4

GA

5

G

TCA

6

C

GAT

7

A

GCT

8

A

CT

9

TA

C10

T

CNFKB known3

398 (785)

00

10

20

bits

1

TG

2

G3

G4

GA

5

CA

6

AT

7

T8

CT

9

C10

CNFKB disc1

303 (457)

00

10

20

bits

1

C

GTA

2

G

CTA

3

GTAC

4

A

G

C5

A

C6

G7

G8

A9

TA

10

AG

11

A

CT

12

T

C

AG

ELF1 known2229 (688)

00

10

20

bits

1

C

GTA

2

T

GAC

3

GC

4

A

C5

G6

G7

A8

A9

A

G10

CT

ELF1 disc1170 (1059)

00

10

20

bits

1

T

CAG

2

AG

3

G4

TG

5

G

T6

G

T

C7

G

A8

T

GA

9

GA

10

G11

T

G12

G

T13

G

T

C14

GA

RXRA known1096 (124)

00

10

20

bits

1

A

GTC

2

T

CA

3

T

CA

4

C

GA

5

A

G6

ATG

7

GACT

8

C9

CA

10

RXRA disc1lowast

70 (82)

00

10

20

bits

1

GC

A

T

2

G

T

CA

3

G

T

C4

TAG

5

T

G

C6

C7

G

T

C8

CA

9

C10

C

A

T

G11

G

A

T

C12

CTGA

13

GC

A

T

14

G

C

A

T

EGR1 known8119 (2996)

00

10

20

bits

1

TAC

2

C3

A

G4

C5

C6

C7

AC

8

C9

G10

T

C

EGR1 disc1lowast

86 (2381)

00

10

20

bits

1

T

CAG

2

TC

GA

3

T

G4

A

GT

5

G

TAC

6

T

C7

G

A8

A9

G

A10

G11

G

T12

G

A

T

C13

T

G

C14

T

GA

15

G

T

C

A

16

HNF4 known18280 (342)

00

10

20

bits

1

CTGA

2

TAG

3

C

A

TG

4 5

G

AT

C6

T

G

CA

7

C

GA

8

T

GA

9

T

C

G10

A

GT

11

A

GTC

12

T

C13

C

GTA

HNF4 disc1lowast

200 (302)

00

10

20

bits

1

T

CAG

2

A

3

CTG

4

GATC

5

T

C

GA

6 7

G

CT

8

A

GC

9

T

GA

10

C

G

AT

11

CG

12

G

T

C13

T

C

AG

14

A

CGT

15

CT

AG

16

C

GTA

17

G

T

AC

18

T

A

C

G

PAX5 known5205 (782)

00

10

20

bits

1

T

C

A

G

2

CT

A

G

3

T

C

AG

4

T

C

A

G

5

TCG

6

ATC

7

TCGA

8

T

ACG

9

GTC

10

T

A

GC

11

TC

GA

12

C

TGA

13

CG

14

GTC

15

T

C

AG

16

C

A

GT

17

T

AG

18

CTGA

19

T

GAC

PAX5 disc1lowast

143 (473)

Known motif Discovered motif

00

10

20

bits

1

C

A

GT

2

TA

C3

C

T4

C

A5

TA

6

TA

7

TA

8

TA

9

T10

A11

T

C

AG

12

G

T

CA

MEF2 known12120 (68)

00

10

20

bits

1

C

G

AT

2

C

ATG

3

G

TAC

4

A

CT

5

G

TCA

6

C

GTA

7

GA

8

TA

9

TA

10

C

T11

GA

12

T

CAG

13

G

T

AC

14

G

TCA

15

GT

C

A

MEF2 disc1lowast

79 (67)

00

10

20

bits

1

C

G

T

A

2

C

G

TA

3

G

C

TA

4

TA

5

G

ACT

6

T

A

C

G7

TAC

8

G

CAT

9

C

ATG

10

G

TCA

11

A

T

GC

12

C

GAT

13

G

TAC

14

C

GTA

15

C

ATG

16

A

T

G

C17

C

TGA

18

AT

19

C

G

AT

20

G

C

AT

21

G

C

A

T

MAF known7578 (682)

00

10

20

bits

1

CT

2

G3

T

AC

4

C

T5

G6

A7

T

A

CG

8

T9

C10

AMAF disc1

371 (501)

00

10

20

bits

1

CA

T

G

2

TAG

3

T

G4

G

CAT

5

G

TA

6

C7

T

GA

8

A

GTC

9

A

C

T

G

10

T

C

AG

11

A

CT

12

G13

AT

14

C

GTA

15

G

A

C16

G

A

TC

17

NR3C1 known17335 (494)

00

10

20

bits

1

T

GA

2

A

T

G3

C

T

GA

4

T

G

CA

5

T

C6

C

GTA

7

A

GCT

8

GCT

9

A

T

G

C

10

CAT

11

T

C

A

G12

A

G

CT

13

G

A

CT

14

ATC

15

ACT

NR3C1 disc1lowast

208 (250)

00

10

20

bits

1

T

G

CA

2

C

GA

3

T

G4

G5

T6

C7

GA

8

AGTC

9

A

T

GC

10

A

CTG

11

CT

12

G13

C

A14

C15

A

C16

G

CT

17

G

A

CT

ESRRA known6356 (604)

00

10

20

bits

1

T

CGA

2

ATG

3

T

A

G4

AGT

5

A

T

G

C6

TGA

7

T

A

G

C

8

T

GAC

9

C

10

G

ACT

11

C

TA

G12

G

TCA

13

A

GT

C14

G

AT

C15

G

A

CT

ESRRA disc1lowast

210 (390)

00

10

20

bits

1

CAT

2

GT

3

TC

4

A5

G6

TAC

7

TA

8

TC

9

C10

TAG

11

C12

G13

G14

TA

15

CG

16

CA

17

TG

18

ATC

19

AG

20

GC

21

CREST known2

669 (1889)

00

10

20

bits

1

A

C

GT

2

A

GCT

3

A

T

C4

A5

G6

T

A

GC

7

A8

C9

G

A

T

C10

C

TGA

11

A

G

CT

12

T

AC

G13

T

CA

G14

TGCA

15

A

T

GC

REST disc1394 (1156)

00

10

20

bits

1 2

CT

G

A

3

G

A

T

4

A

GTC

5

A

GTC

6

A

GT

C7

C

AT

8

A

T9

C

AT

10

A

T

C

G11

TA

12

AT

13

T

A

GC

14

G

CAT

15

CG

T

A

16

C

G

A

T

17

AC

TCF7L2 known5109 (78)

00

10

20

bits

1

A

G

T

C

2

C3

AT

4

T5

T6

CG

7

A8

AT

9

CG

10

CT

TCF7L2 disc2lowast

60 (90)

00

10

20

bits

1

C

G

A

T

2

AG

3

G

A

C4

A

G5

C6

T

C

GA

7

G

A

C

T8

G9

T

C10

T

G11

TC

12

TC

GA

NRF1 known2847 (11965)

00

10

20

bits

1

A

GCT

2

T

C

A

G3

G

A

C4

A

G5

C6

T

GCA

7

A

C

GT

8

C

G9

C10

T

C

G11

T

C12

C

T

GA

13

T

ACG

14

G

C

A

T

15

T

CAG

NRF1 disc1lowast

413 (9697)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14238 (313)

00

10

20

bits

1

T

C

A

G

2

T

AGC

3

CTA

4

G5

A6

T7

TA

8

CA

9

C

G10

ACG

GATA disc1lowast

69 (234)

Table S3 (continued)

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 20: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Known motif Discovered motif

00

10

20

bits

1

G

C

TA

2

C

GTA

3

GACT

4

G

A5

A

G

CT

6

A

TG

7

A

GTC

8

GTA

9

C

TGA

10

T

G

A11

C

GAT

12

C

A

G

T

POU2F2 known15211 (53)

00

10

20

bits

1

G

CTA

2

C

AGT

3

CAT

G4

GTA

C5

G

C

TA

6

C

GTA

7

TGCA

8

CGAT

POU2F2 disc161 (110)

00

10

20

bits

1

GACT

2

A

CGT

3

C

GT

4

A

CG

5

C

G6

C7

G8

GC

9

A

TCG

10

T

G

CA

11

C

GTA

12

T

C

GA

E2F known9121 (1275)

00

10

20

bits

1

T

C

2

A

G

CT

3

A

G

CT

4

CT

5

GT

C6

GC

7

A

T

C8

T

G9

G

C10

TGC

11

G

T

A

C

12

G

A

T

C

13

T

A

GC

E2F disc3lowast

31 (712)

00

10

20

bits

1

T

C

A

G

2

C

A

GT

3

A

C4

T

C

A5

C6

G7

G

C

A

T8

G9

T

CA

10

A

G

TC

BHLHE40 known4702 (1684)

00

10

20

bits

1

GT

2

C3

G

TCA

4

TG

C5

CA

G6

C

GAT

7

G8

CA

9

T

GC

10

AG

C

T

BHLHE40 disc1173 (1670)

00

10

20

bits

1

C

TAG

2

ATG

3

AG

4

G5

G6

ATC

7

T

G8

CTG

9

C

TAG

10

C

ATG

11

G

A

CT

SP1 known858 (377)

00

10

20

bits

1

TA

CG

2

A

TCG

3

AG

4

TGA

5

C

AG

6

T

AG

7

AG

8

A

G9

C

A

T

G10

TAC

11

TAG

12

A

G13

T

AG

14

AG

15

T

A

GC

SP1 disc3lowast

10 (88)

00

10

20

bits

1

T

CGA

2

T

AG

C3

A

C4

G5

T

C

G6

C

A7

G

TA

8

C

AG

9

A

G

CT

10

C

T

GA

ETS known18539 (3061)

00

10

20

bits

1

CGA

2

AGC

3

A

C4

G5

G6

A7

TA

8

A

G9

CT

10

T

ACG

ETS disc2lowast

80 (3306)

00

10

20

bits

1

G

A

TC

2

A

TC

3

G4

A5

T

A6

A7

A

G

C8

TC

9

G10

A11

T

A12

A13

GC

14

A

CT

IRF known20610 (741)

00

10

20

bits

1

C

TGA

2

TCGA

3

CG

A

4

C

A

GT

5

T

AG

6

G

A7

T

A8

A9

TACG

10

ACT

11

A

G12

GA

13

A14

G

A15

A

T

CG

IRF disc3lowast

88 (154)

00

10

20

bits

1

A

G

CT

2

T

CAG

3

G

CA

4

AC

5

A

C6

G

CT

7

GACT

8

CT

9

T

CAG

10

T

G

CA

11

T

G

AC

12

T

G

A

C13

G

CT

14

A

G

TC

NR2C2 known1534 (518)

00

10

20

bits

1

GCA

2

T

G

A

C3

T

C4

A

CT

5

G

A

CT

6

CT

7

AC

G8

T

G

CA

9

A

G

C10

A

T

CNR2C2 disc2lowast

40 (334)

00

10

20

bits

1

G

A

TC

2

ATGC

3

G

A

TC

4

GTC

5

TGCA

6

G

A

CT

7

T8

T9

A

T

C10

C11

GC

12

G13

C

AT

G14

A15

A16

C

G

T

A

17

AGCT

18

A

TGC

19

T

C

G

A

20

G

T

A

C

21

T

GAC

STAT known21163 (1310)

00

10

20

bits

1

G

A

CT

2

T3

T4

A

C5

T

C6

GC

A

T

7

G8

TG

9

A10

ASTAT disc1lowast

24 (561)

Table S3 (continued)

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 21: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Factor Matching discovered motifs

00

10

20

bits

1

T

GAC

2

T3

TG

4

A5

TGC

6

AT

7

AC

8

A9

A

G

CT

AP1 known5(TRE)

00

10

20

bits

1

T2

G3

A4

AT

G

C

5

T6

A

C7

A8

A

GTC

9

C

G

TA

10

T

A

GC

GATA disc2 (32)

00

10

20

bits

1

T

AGC

2

T3

C

T

G4

A5

A

CG

6

T7

A

C8

T

CA

9

G

TC

10

AGTC

SMARC disc1 (34)

00

10

20

bits

1

G

A

2

T

C

GA

3

T4

G5

A6

T

A

G

C

7

T8

C9

A10

A

G

CT

STAT disc2 (46)

00

10

20

bits

1

CGA

2

T3

G4

A5

T

A

CG

6

T7

C8

A9

GCT

10

AT

C

TCF7L2 disc1 (35)

00

10

20

bits

1

C

G

TA

2

CT

G

A

3

T

C

GA

4

G

ACT

5

C

A

GT

6

C

TAG

7

C

A

GT

8

T

G9

GATC

10

T

CA

11

G

A12

GC

AT

CEBPB known7

00

10

20

bits

1

T2

T3

TG

4

AT

5

G6

TC

7

A8

A9

TSTAT disc4lowast (47)

00

10

20

bits

1

G

A

CT

2

C

A

T

G

3

C

TAG

4

G

TA

C5

A

C6

C

GTA

7

A

GC

8

G

ATC

9

T

C

GA

10

A

G11

TAG

12

ATG

13

T

A

G14

C

AT

G15

GTA

C16

T

AG

17

T

A

GC

18

G

A

CT

19

T

CGA

CTCF known1

00

10

20

bits

1

TC

2

A3

G4

AG

5

T

G6

G7

G8

A

C9

AG

10

ATGC

CTCFL disc1lowast (64)

00

10

20bits

1

C

T

A

G

2

C

T

AG

3

G

TA

C4

C5

GTA

6

AT

GC

7

G

ATC

8

C

TGA

9

C

A

G10

GA

11

CA

GT

12

A

G13

C

AT

G14

G

TAC

15

C

T

GA

16

T

A

GC

17

G

A

CT

18

T

CGA

19

A

C

GT

20

G

C

TA

RAD21 disc1lowast (62)

00

10

20

bits

1

T

C

AG

2

TGAC

3

C4

C

TGA

5

T

GC

6

G

ATC

7

T

C

GA

8

T

C

G9

AG

10

A

TG

11

T

C

A

G12

C

AT

G13

G

TA

C14

C

T

GA

15

T

A

GC

SMC3 disc1 (63)

00

10

20

bits

1

TC

A

G

2

TC

GA

3

C

T

A

G

4

G

TAC

5

G

TA

C6

TG

7

A

G8

GA

9

TCA

10

T

AG

ETS known7

00

10

20

bits

1

A

GCT

2

CAG

3

T

CA

4

AC

5

C6

T

C7

G8

G9

A10

TA

NR2C2 disc1 (51)

00

10

20

bits

1

CGA

2

A

GC

3

A4

G5

G6

A7

TA

8

CAG

9

GCT

10

T

C

AG

11

G

C

A

12

A

CGT

13

C

AGT

14

T

A

C

G

15

T

AGC

ETS known4

00

10

20

bits

1

C

T

AG

2

AGC

3

TA

4

G5

C

G6

G

A7

T

GA

8

GA

9

C

AGT

10

C

TAG

11

T

C

GA

GATA disc3 (49)

00

10

20

bits

1

A2

C

G3

G4

A5

A6

CA

7

CT

8

CAG

9

AMEF2 disc2 (50)

00

10

20

bits

1

A

T

G

C

2

A

CGT

3

A

CTG

4

A

TC

G

5 6 7 8 9 10

C

T

G

A

11

T

A

G

C

12

GC

TA

13

G14

C

A15

G

A

T16

G

TA

17

C

TGA

18

T

CAG

GATA known14

00

10

20

bits

1

TAGC

2

C

GAT

3

A

T

G4

A

T

CG

5 6

TA

C

G

7

G

8

G

9

TA

C

G

10 11

T

A

GC

12

GC

TA

13

A

G14

A15

GCT

16

CTA

17

CTGA

18

T

CAG

19

T

C

AG

TAL1 disc1(66 67)

00

10

20

bits

1

G

A

T

C

2

G

CTA

3

T

C

A

G

4

GAT

5

G6

GTA

7

C8

A9

T

A

CG

10

T

C

AG

11

C

G

A

T

12

T

A

GC

MEIS1 1

00

10

20

bits

1

C

GAT

2

G3

A4

A

CTG

5

CGT

6

G7

GA

8

T

C9

GA

10

C

AT

G

PBX3 disc2 (65)

Table S4 Selected shared motifs with literature support Shown are the motifs that match a knownmotif for the indicated factor along with relevant citations Details in the text lowast indicates motif isreverse complemented for comparison purposes

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 22: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Factor Matching discovered motifs

00

10

20

bits

1

A

G

T

C

2

CA

3

A4

TG

C5

A

C

GT

6

T

CAG

7 8

T

GAC

9

MYB 4

00

10

20

bits

1

A2

A3

C4

G5

G6

CA

7

T

CA

8

TA

C

G

ETS disc8 (54)

00

10

20

bits

1

T

C

G

A

2

C

A

T

G

3

T

C

GA

4

G

A

CT

5

C6

C

G

A7

T

C8

A

G9

G

C

T10

G11

C

T

GA

12

G

A

CT

13

G

T

A

C

14

A

G

C

T

MYC known3

00

10

20

bits

1

C

TAG

2

T

ACG

3

A

TC

4

C5

A6

TG

C7

A

C

G8

T9

G10

GA

SIN3A disc2 (38)

00

10

20

bits

1 2

G

TAC

3

A

GCT

4

A

G

CT

5

T

C

GA

6

C

AG

7

C8

G

A

C9

A10

TCA

11

C

GT

12

TAGC

13

T

C

GA

NFY known5

00

10

20

bits

1

A2

G3

C4

C5

CA

6

A7

CT

8

GC

9

GA

CEBPB disc2 (43)

00

10

20

bits

1

TGC

2

T

C

GA

3

TAG

4

C5

C6

A7

A8

T9

AGC

10

C

GA

E2F disc4lowast (45)

00

10

20

bits

1

G

ATC

2

GA

3

TAG

4

C5

C6

A7

A8

T9

T

AGC

10

GA

IRF disc1 (40)

00

10

20

bits

1

T

C

GA

2

C

TAG

3

C4

A

C5

A6

G

C

A7

G

C

A

T8

T

A

GC

9

CGA

10

T

CAG

RFX5 disc2 (42)

00

10

20

bits

1

A

G

C

T

2

GA

T

C

3

T

C

GA

4

C

TAG

5

C6

A

C7

G

A8

A9

A

GT

10

T

AGC

11

C

GA

12

T

CAG

13

T

GCA

14

C

T

GA

15

T

A

C

G

16

T

A

C

G

17

T

A

CG

18

C

TAG

19

T

A

C

G

20

C

A

G

21

CATG

22

CG

T

SP1 disc1lowast (44)

00

10

20

bits

1

G

A

TC

2

G

C

AT

3

A

GCT

4

C

A

GT

5

T

A

CG

6

G

C

AT

7

G

A

CT

8

G

C

TA

9

G

C

A

T10

C

A

T

G11

A

GTC

12

CG

TA

13

C

T

GA

14

CTGA

15

C

AGT

POU5F1 known3

00

10

20

bits

1

A

CTG

2

GTC

3

A

TC

4

GC

AT

5

G

CT

6

T7

T

CG

8

G

CAT

9

A

GCT

10

G

ACT

11

G

A

CT

12

A

CTG

13

A

GTC

14

G

TCA

15

C

G

T

A

NANOG disc2lowast

(57 58)

00

10

20

bits

1

GCT

2

G

C3

TA

4

G5

AC

6

TA

7

GC

8

AC

9

GA

10

GTC

11

C

G12

A

G13

GA

14

GC

15

TA

16

C

G17

G

ATC

18

TAG

19

AGC

REST known3

00

10

20

bits

1

A

G

CT

2

T

A

C3

G

A4

A

G5

TAG

C6

A7

C8

G

AT

C9

C

T

GA

10

G

ATC

11

A

G12

A

G13

C

T

GA

14

A

T

G

C15

C

G

TA

SIN3A disc1lowast (37)

00

10

20

bits

1

CGA

2

T

A

G3

A

G4

A5

G

A6

T

CAG

7

GACT

SPI1 known2

00

10

20

bits

1

GA

2

G3

G4

GA

5

GA

6

CG

7

GAT

8

G9

GA

10

CGA

IRF disc5 (41)

00

10

20

bits

1 2

G

ATC

3

A

T

G

C

4

C

ATG

5

GAT

C6

C7

TA

8

A

GT

9

GA

C

T

10

G

CAT

11

CAGT

YY1 known5

00

10

20

bits

1

C2

GC

3

A

G4

A

C5

C6

A7

T8

G9

T10

TTHAP1 disc2

(55 56)

Table S4 (continued)

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 23: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Discovered motif Motif from (85)

00

10

20

bits

1

A

G

CT

2

T

GAC

3

C

GAT

4

C5

A

G6

C7

G8

T

CGA

9

T

C

G10

GA

BRCA1 disc1 (Novel1)

00

10

20

bits

1

TC

2 3

T4

C5

G6

C7

G8

A9

TG

10

AM8

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

TC

2

G3

G4

G5 6 7 8

TC

9

GA

10

T11

A12

G13

TM4

00

10

20

bits

1

A

CT

2

T

A

C

G3

T

C

A

G4

TA

G5

C

GA

6

C

TAG

7

CGT

8

T9

G10

T11

A12

G13

T14

TETS disc1 (Novel2)

00

10

20

bits

1

CA

2 3

T4

T5

G6

TA

7

TA

8

G9

T10

TM108FAC1

00

10

20

bits

1

C

T

A

G2

CAT

G3

TCGA

4

TCA

G5

A

GCT

6

AGCT

7

A

TC

G8

ACGT

ETS disc5 (Novel2)

00

10

20

bits

1

GA

2

GA

3

A4

G5

T6

T7

G8

TM129

00

10

20

bits

1

G2

G3

G4

TA

5

C

T

GA

6

C

G

AT

7

CT

8

G9

T10

ASIX5 disc2 (Novel2)

00

10

20

bits

1

TC

2

GA

3

G4

G5

A6 7

A8

T9

G10 11

AM75STAT1

00

10

20

bits

1

A2

A3

G4

G5

G6

G7

C8

G9

AT

G10

TC

G

SP2 disc3 (Novel3)

00

10

20

bits

1

G2

G3

G4

C5

G6

G7

GA

M6SP1

00

10

20

bits

1

C2

G3

GC

4

TAGC

5

T

AGC

6

G

C7

T8

TZBTB7A disc2 (Novel3)

00

10

20

bits

1

G2

C3

G4

GC

5

C6

CA

7 8

T9

T10

TM164

00

10

20

bits

1

A2

A3

G4

C5

G6

AG

7

A8

TC

GA

TATA disc5 (Novel6)

00

10

20

bits

1

G2

G3

A4

A5 6

C7

G8

G9

A10

A11 12

TC

M21

Table S5 Novel motifs matching those found using systematic conservation measures in four mammalsby Xie et al (85) One representative motif is shown for each putative novel motif when multiple matchthe same motif Each motif from (85) is named using the provided identifier and annotation

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 24: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

References

8 The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the

human genome Nature 489 57ndash74

10 Gerstein M B Kundaje A Hariharan M Landt S G Yan K-K Cheng C Mu X J

Khurana E Rozowsky J Alexander R Min R Alves P Abyzov A Addleman N Bhardwaj

N Boyle A P Cayting P Charos A Chen D Z Cheng Y Clarke D Eastman C

Euskirchen G Frietze S Fu Y Gertz J Grubert F Harmanci A Jain P Kasowski M

Lacroute P Leng J Lian J Monahan H OrsquoGeen H Ouyang Z Partridge E C Patacsil

D Pauli F Raha D Ramirez L Reddy T E Reed B Shi M Slifer T Wang J Wu L

Yang X Yip K Y Zilberman-Schapira G Batzoglou S Sidow A Farnham P J Myers

R M Weissman S M and Snyder M (2012) Architecture of the human regulatory network

derived from ENCODE data Nature 489 91ndash100

11 Matys V Fricke E Geffers R Gossling E Haubrock M Hehl R Hornischer K Karas

D Kel A E Kel-Margoulis O V Kloos D Land S Lewicki-Potapov B Michael H

Munch R Reuter I Rotert S Saxel H Scheer M Thiele S and Wingender E (2003)

TRANSFAC(R) transcriptional regulation from patterns to profiles Nucleic Acids Research 31

374ndash378

12 Sandelin A Alkema W Engstrm P Wasserman W W and Lenhard B (2004) JASPAR an

open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Research

32 D91ndash94

13 Berger M F Philippakis A A Qureshi A M He F S Estep P W and Bulyk M L

(2006) Compact universal DNA microarrays to comprehensively determine transcription-factor

binding site specificities Nature Biotechnology 24 1429ndash1435

14 Badis G Berger M F Philippakis A A Talukder S Gehrke A R Jaeger S A Chan

E T Metzler G Vedenko A Chen X Kuznetsov H Wang C Coburn D Newburger

D E Morris Q Hughes T R and Bulyk M L (2009) Diversity and Complexity in DNA

Recognition by Transcription Factors Science 324 1720ndash1723

15 Berger M F Badis G Gehrke A R Talukder S Philippakis A A Pea-Castillo L Alleyne

T M Mnaimneh S Botvinnik O B Chan E T Khalid F Zhang W Newburger D

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 25: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Jaeger S A Morris Q D Bulyk M L and Hughes T R (2008) Variation in homeodomain

DNA binding revealed by high-resolution analysis of sequence preferences Cell 133 1266ndash1276

16 Jolma A Yan J Whitington T Toivonen J Nitta K R Rastas P Morgunova E Enge

M Taipale M Wei G Palin K Vaquerizas J M Vincentelli R Luscombe N M Hughes

T R Lemaire P Ukkonen E Kivioja T and Taipale J (2013) DNA-Binding Specificities of

Human Transcription Factors Cell 152 327ndash339

17 Hughes J D Estep P W Tavazoie S and Church G M (2000) Computational identification

of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces

cerevisiae Journal of Molecular Biology 296 1205ndash1214

18 Liu X S Brutlag D L and Liu J S (2002) An algorithm for finding protein-DNA binding sites

with applications to chromatin-immunoprecipitation microarray experiments Nature Biotechnol-

ogy 20 835ndash9

19 Bailey T L and Elkan C (1994) Fitting a mixture model by expectation maximization to discover

motifs in biopolymers Proceedings of the International Conference on Intelligent Systems for

Molecular Biology 2 28ndash36

20 Pavesi G Mauri G and Pesole G (2001) An algorithm for finding signals of unknown length

in DNA sequences Bioinformatics 17 S207ndashS214

21 Ettwiller L Paten B Ramialison M Birney E and Wittbrodt J (2007) Trawler de novo

regulatory motif discovery pipeline for chromatin immunoprecipitation Nature Methods 4 563ndash

565

32 Kawana M Lee M E Quertermous E E and Quertermous T (1995) Cooperative interac-

tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene Molecular and Cellular

Biology 15 4225ndash4231

34 Ito T Yamauchi M Nishina M Yamamichi N Mizutani T Ui M Murakami M and Iba

H (2001) Identification of SWISNF complex subunit BAF60a as a determinant of the transacti-

vation potential of FosJun dimers The Journal of Biological Chemistry 276 2852ndash2857

35 Nateri A S Spencer-Dene B and Behrens A (2005) Interaction of phosphorylated c-Jun with

TCF4 regulates intestinal cancer development Nature 437 281ndash285

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 26: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

37 Huang Y Myers S J and Dingledine R (1999) Transcriptional repression by REST recruitment

of Sin3A and histone deacetylase to neuronal genes Nature Neuroscience 2 867ndash872

38 Nascimento E M Cox C L Macarthur S Hussain S Trotter M Blanco S Suraj M

Nichols J Kbler B Benitah S A Hendrich B Odom D T and Frye M (2011) The

opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis

Nature Cell Biology 13 1395ndash1405

40 Li-Weber M Davydov I Krafft H and Krammer P (1994) The role of NF-Y and IRF-2 in

the regulation of human IL-4 gene expression The Journal of Immunology 153 4122 ndash4133

41 Scott E Simon M Anastasi J and Singh H (1994) Requirement of transcription factor PU1

in the development of multiple hematopoietic lineages Science 265 1573 ndash1577

42 Villard J Peretti M Masternak K Barras E Caretti G Mantovani R and Reith W

(2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility

Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y Molecular

and Cellular Biology 20 3364 ndash3376

43 Yu L Wu Q Yang C P and Horwitz S B (1995) Coordination of transcription factors NF-Y

and CEBP beta in the regulation of the mdr1b promoter Cell Growth amp Differentiation The

Molecular Biology Journal of the American Association for Cancer Research 6 1505ndash1512

44 Roder K Wolf S Larkin K and Schweizer M (1999) Interaction between the two ubiquitously

expressed transcription factors NF-Y and Sp1 Gene 234 61ndash69

45 Caretti G Salsi V Vecchi C Imbriano C and Mantovani R (2003) Dynamic Recruitment

of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters Journal of Biological Chemistry

278 30435 ndash30440

46 Ivanov V N Bhoumik A Krasilnikov M Raz R Owen-Schaub L B Levy D Horvath

C M and Ronai Z (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription

Molecular Cell 7 517ndash528

47 Choi S Cho Y Kim H and Park J (2007) ROS mediate the hypoxic repression of the hepcidin

gene by inhibiting CEBPalpha and STAT-3 Biochemical and Biophysical Research Communica-

tions 356 312ndash317

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 27: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

49 Rothbcher U Bertrand V Lamy C and Lemaire P (2007) A combinatorial code of maternal

GATA Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian

ectoderm Development 134 4023ndash4032

50 Taylor J M Dupont-Versteegden E E Davies J D Hassell J A Houl J D Gurley C M

and Peterson C A (1997) A role for the ETS domain transcription factor PEA3 in myogenic

differentiation Molecular and Cellular Biology 17 5550ndash5558

51 OrsquoGeen H Lin Y Xu X Echipare L Komashko V M He D Frietze S Tanabe O Shi

L Sartor M A Engel J D and Farnham P J (2010) Genome-wide binding of the orphan

nuclear receptor TR4 suggests its general role in fundamental biological processes BMC Genomics

11 689

54 Dudek H Tantravahi R V Rao V N Reddy E S and Reddy E P (1992) Myb and Ets

proteins cooperate in transcriptional activation of the mim-1 promoter Proceedings of the National

Academy of Sciences 89 1291 ndash1295

55 Mazars R Gonzalez-de-Peredo A Cayrol C Lavigne A Vogel J L Ortega N Lacroix C

Gautier V Huet G Ray A Monsarrat B Kristie T M and Girard J (2010) The THAP-

zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase a link

between DYT6 and DYT3 dystonias The Journal of Biological Chemistry 285 13364ndash13371

56 Yu H Mashtalir N Daou S Hammond-Martel I Ross J Sui G Hart G W Rauscher

Frank J r Drobetsky E Milot E Shi Y and Affar E B (2010) The ubiquitin carboxyl

hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene

expression Molecular and Cellular Biology 30 5071ndash5085

57 Looijenga L H J Stoop H de Leeuw H P J C de Gouveia Brazao C A Gillis A J M

van Roozendaal K E P van Zoelen E J J Weber R F A Wolffenbuttel K P van Dekken

H Honecker F Bokemeyer C Perlman E J Schneider D T Kononen J Sauter G and

Oosterhuis J W (2003) POU5F1 (OCT34) identifies cells with pluripotent potential in human

germ cell tumors Cancer Research 63 2244ndash2250

58 Loh Y Wu Q Chew J Vega V B Zhang W Chen X Bourque G George J Leong B

Liu J Wong K Sung K W Lee C W H Zhao X Chiu K Lipovich L Kuznetsov V A

Robson P Stanton L W Wei C Ruan Y Lim B and Ng H (2006) The Oct4 and Nanog

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 28: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

transcription network regulates pluripotency in mouse embryonic stem cells Nature Genetics 38

431ndash440

62 Wendt K S Yoshida K Itoh T Bando M Koch B Schirghuber E Tsutsumi S Nagae

G Ishihara K Mishiro T Yahata K Imamoto F Aburatani H Nakao M Imamoto N

Maeshima K Shirahige K and Peters J (2008) Cohesin mediates transcriptional insulation by

CCCTC-binding factor Nature 451 796ndash801

63 Rubio E D Reiss D J Welcsh P L Disteche C M Filippova G N Baliga N S Aebersold

R Ranish J A and Krumm A (2008) CTCF physically links cohesin to chromatin Proceedings

of the National Academy of Sciences of the United States of America 105 8309ndash8314

64 Jelinic P Stehle J and Shaw P (2006) The Testis-Specific Factor CTCFL Cooperates with the

Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation PLoS Biol 4

e355

65 Bischof L J Kagawa N Moskow J J Takahashi Y Iwamatsu A Buchberg A M and

Waterman M R (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co-

operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17 Journal of Biological

Chemistry 273 7941 ndash7948

66 Kappel A Schlaeger T M Flamme I Orkin S H Risau W and Breier G (2000) Role of

SCLTal-1 GATA and ets transcription factor binding sites for the regulation of flk-1 expression

during murine vascular development Blood 96 3078ndash3085

67 Mouthon M A Bernard O Mitjavila M T Romeo P H Vainchenker W and Mathieu-

Mahul D (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis

Blood 81 647ndash655

85 Xie X Lu J Kulbokas E J Golub T R Mootha V Lindblad-Toh K Lander E S

and Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime]

UTRs by comparison of several mammals Nature 434 338ndash345

97 Lindblad-Toh K Garber M Zuk O Lin M F Parker B J Washietl S Kheradpour P

Ernst J Jordan G Mauceli E Ward L D Lowe C B Holloway A K Clamp M Gnerre

S Alfoldi J Beal K Chang J Clawson H Cuff J Di Palma F Fitzgerald S Flicek

P Guttman M Hubisz M J Jaffe D B Jungreis I Kent W J Kostka D Lara M

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 29: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

Martins A L Massingham T Moltke I Raney B J Rasmussen M D Robinson J Stark

A Vilella A J Wen J Xie X Zody M C Worley K C Kovar C L Muzny D M Gibbs

R A Warren W C Mardis E R Weinstock G M Wilson R K Birney E Margulies

E H Herrero J Green E D Haussler D Siepel A Goldman N Pollard K S Pedersen

J S Lander E S and Kellis M (2011) A high-resolution map of human evolutionary constraint

using 29 mammals Nature 478 476ndash482

101 MacArthur S Li X Li J Brown J Chu H C Zeng L Grondona B Hechmer A

Simirenko L Keranen S Knowles D Stapleton M Bickel P Biggin M and Eisen M

(2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative

differences in binding to an overlapping set of thousands of genomic regions Genome Biology 10

R80

102 Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein

multiple-alignments Nucleic Acids Research 24 3836ndash3845

103 Gray K A Daugherty L C Gordon S M Seal R L Wright M W and Bruford E A

(2013) Genenamesorg the HGNC resources in 2013 Nucleic acids research 41 D545ndash552

104 Kharchenko P V Tolstorukov M Y and Park P J (2008) Design and analysis of ChIP-seq

experiments for DNA-binding proteins Nat Biotech 26 1351ndash1359

105 Kent W J Sugnet C W Furey T S Roskin K M Pringle T H Zahler A M and

Haussler D (2002) The Human Genome Browser at UCSC Genome Research 12 996 ndash1006

106 Harrow J Frankish A Gonzalez J M Tapanari E Diekhans M Kokocinski F Aken

B L Barrell D Zadissa A Searle S Barnes I Bignell A Boychenko V Hunt T Kay

M Mukherjee G Rajan J Despacio-Reyes G Saunders G Steward C Harte R Lin M

Howald C Tanzer A Derrien T Chrast J Walters N Balasubramanian S Pei B Tress

M Rodriguez J M Ezkurdia I van Baren J Brent M Haussler D Kellis M Valencia

A Reymond A Gerstein M Guig R and Hubbard T J (2012) GENCODE The reference

human genome annotation for The ENCODE Project Genome research 22 1760ndash1774

107 Touzet H and Varre J (2007) Efficient and accurate P-value computation for Position Weight

Matrices Algorithms for Molecular Biology 2 15

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215

Page 30: Systematic discovery and characterization of regulatory ...compbio.mit.edu/publications/100_Kheradpour_NAR_13.pdfUnderstanding the motif content of these data sets is an important

108 Wilson E B (1927) Probable Inference the Law of Succession and Statistical Inference Journal

of the American Statistical Association 22 209ndash212

109 Mahony S Auron P E and Benos P V (2007) DNA Familial Binding Profiles Made Easy

Comparison of Various Motif Alignment and Clustering Strategies PLoS Comput Biol 3 e61

110 Sandelin A and Wasserman W W (2004) Constrained binding site diversity within families of

transcription factors enhances pattern discovery bioinformatics Journal of molecular biology 338

207ndash215