dependency of the cancer-specific ... - ars.els-cdn.com · machineries in cancers authors yu liu,...

Resource

Dependency of the Cance
r-Specific TranscriptionalRegulation Circuitry on the Promoter DNAMethylome
Graphical Abstract

DNA methylation

mRNA expression

DNA copy number

, ,,

MeTRNDNA Methylation-dependent

Transcription Regulatory Networks21 Cancers in TCGA

Highlights

d An analysis pipeline based on information theory for TCGA

cancer multi-omics data

d Genome-wide surveys of promoter CpG sites in modulating

transcription in cancers

d Transcription factors and CpG sites coupled in determining

gene expression dynamics

d Resource for dissecting the gene expression dysregulation

machineries in cancers

Liu et al., 2019, Cell Reports 26, 3461–3474March 19, 2019 ª 2019 The Author(s).https://doi.org/10.1016/j.celrep.2019.02.084

Authors

Yu Liu, Yang Liu, Rongyao Huang, ...,

Shengcheng Dong, Yang Yang,

Xuerui Yang

[email protected]

In Brief

Using an analysis pipeline based on

information theory and tailored for cancer

multi-omics data in TCGA, Yu et al.

conducted genome-wide surveys of DNA

promoter methylome in modulating

transcriptional regulation circuits in

cancers. Results serve as a resource for

dissecting gene expression

dysregulation in cancer.

mailto:[email protected]

https://doi.org/10.1016/j.celrep.2019.02.084

http://crossmark.crossref.org/dialog/?doi=10.1016/j.celrep.2019.02.084&domain=pdf

Cell Reports

Resource

Dependency of the Cancer-Specific TranscriptionalRegulation Circuitry on the Promoter DNAMethylomeYu Liu,1,2,3,4,6 Yang Liu,1,3,4,5,6 Rongyao Huang,1,3,4,6 Wanlu Song,1,2,3,4 Jiawei Wang,1,3,4 Zhengtao Xiao,1,2,3,4

Shengcheng Dong,1,3,4 Yang Yang,1,3,4 and Xuerui Yang1,3,4,7,*1MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China2Tsinghua-Peking Joint Center for Life Sciences, Beijing 100084, China3Center for Synthetic & Systems Biology, Tsinghua University, Beijing 100084, China4School of Life Sciences, Tsinghua University, Beijing 100084, China5Joint Graduate Program of Peking-Tsinghua-National Institute of Biological Science, Tsinghua University, Beijing 100084, China6These authors contributed equally7Lead Contact*Correspondence: [email protected]


SUMMARY

Dynamic dysregulation of the promoter DNA methyl-ome is a signature of cancer. However, comprehen-sive understandings about how the DNA methylomeis incorporated in the transcriptional regulation cir-cuitry and involved in regulating the gene expressionabnormality in cancers are still missing.We introducean integrative analysis pipeline based on mutual in-formation theory and tailored for the multi-omicsprofiling data in The Cancer Genome Atlas (TCGA)to systematically find dependencies of transcrip-tional regulation circuits on promoter CpG methyl-ation profiles for each of 21 cancer types. Bycoupling transcription factors with CpG sites, thiscancer type-specific transcriptional regulation cir-cuitry recovers a significant layer of expression regu-lation for many cancer-related genes. The coupledCpG sites and transcription factors also serve asmarkers for classifications of cancer subtypes withdifferent prognoses, suggesting physiological rele-vance of such regulation machinery recapitulatedhere. Our results therefore generate a resource forfurther studies of the epigenetic scheme in geneexpression dysregulations in cancers.

INTRODUCTION

DNA methylation of CpG dinucleotides has been shown to play

critical roles in pluripotency, development, and various diseases

(Robertson, 2005; Smith and Meissner, 2013). Alteration of the

methylation status at promoter CpG sites is associated with

expression dysregulation of many cancer-related genes and is

therefore recognized as one of the major driving factors of tumor

initiation and development (Chatterjee and Vinson, 2012; Jones,

2012). Promoter DNA methylation is generally considered as a

potent epigenetic repressor of gene transcription by blocking

the recruitment of transcription factors (TFs) (Blattler and Farn-

ham, 2013; Domcke et al., 2015; Jones, 2012), while recent

Cell RThis is an open access article under the CC BY-N

studies have also uncovered many TF binding events that

actually depend on methylated CpG (mCpG) (Hu et al., 2013;

Liu et al., 2012; Zhu et al., 2016). These studies were focused

on detailed machineries of specific genes being regulated by

CpG methylation-sensitive TFs in various contexts. However,

genome-wide coupling of specific functional promoter CpG sites

with the transcriptional regulation circuits from TFs to target

genes has not been systematically revealed yet. Given the large

number of promoter CpG sites, of which the functions are mostly

unknown, systematic inference of the transcriptional regulatory

networks that take into account the potential modulatory func-

tion of the promoter DNA methylome would generate a compre-

hensive view of the epigenetic scheme of gene transcription

regulation and an insightful resource for further mechanistic

studies.

Transcription regulatory networks (TRNs), which are

composed of regulatory circuits from TFs to their target genes,

provide detailed maps of gene expression regulations in specific

cellular contexts (Vaquerizas et al., 2009). Various experimental

(Hu et al., 2007; Johnson et al., 2007; Neph et al., 2012a; Sopko

et al., 2006) and computational (Basso et al., 2005; Marbach

et al., 2012; Margolin et al., 2006; Wang et al., 2009b) methodol-

ogies have been developed to assemble the transcription regu-

latory networks in different contexts. Recently, strategies of inte-

grating multi-source information have also been used for

systematic recovery of the context-specific gene regulatory

repertoire (Budden et al., 2014; Jiang et al., 2015; Li et al.,

2014; Wang et al., 2013). Although previous experimental and

integrative strategies revealed highly valuable and biologically

relevant insights into the gene regulatory logic, they relied sub-

stantially on availability and quality of regulation profiling data,

such as chromatin immunoprecipitation sequencing (ChIP-

seq), DNaseI sequencing (DNaseI-seq), and histone modifica-

tion. On the other hand, computational de novo reverse engi-

neering of the transcription networks usually uses mathematical

and statistical approaches to uncover dependencies between

TFs and targets from gene expression datasets (Basso et al.,

2005; Marbach et al., 2012; Margolin et al., 2006; Wang et al.,

2009b). These methods generated genome-wide surveys of

TF-target associations, which are very useful for systematically

understanding the context-specific transcriptional regulation

eports 26, 3461–3474, March 19, 2019 ª 2019 The Author(s). 3461C-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).



http://crossmark.crossref.org/dialog/?doi=10.1016/j.celrep.2019.02.084&domain=pdf

http://creativecommons.org/licenses/by-nc-nd/4.0/

A

B C

Figure 1. Genome-wide Identifications of

the TF-Target Gene Regulatory Circuits

Modulated by Promoter CpG Methylation

Levels

(A) Schematic description of the methodology for

search of the TF-target gene circuits that depend

on the level of methylation at specific promoter

CpG sites. Taking a TF i (TFi ), a potential target

gene j (Genej ) and its DNA copy number (CNVj ), as

well as a CpG sitem within the promoter region of

the target gene j (CpGjm) as examples, the diagram

shows how the multi-omics data were used to test

whether the TF-target gene association depends

on promoter CpG methylation. Simulated data

entries of true negatives were added into the

original datasets and used to estimate false dis-

covery rates at each step of the pipeline. Details of

the pipeline are provided in the Method Details

section.

(B and C) Examples showing positive (B) and

negative (C) correlation patterns between the

methylation levels at promoter CpG sites and the

gene expression profiles in tumors of BLCA.

programs, especially when the high-throughput experimental

methods are not applicable, such as when dealing with clinical

samples from limited source. However, most of these previous

efforts did not take into account the information of DNA methyl-

ation for de novo reconstruction of the cancer type-specific tran-

scritpion regulatory networks, despite extensive findings about

the interplay between DNA methylation and TFs in determining

the gene expression profile. This is partially due to lack of suit-

able data and specially designed analysis tools.

In this study, to systematically assess the potential involve-

ment of DNA methylation in gene transcriptional regulation, we

performed an integrative analysis of multi-omics data, including

genome-wide mRNA expression, DNA copy number variation

(CNV), and CpG methylation data for 21 cancer types that

have at least 150 independent tumor samples from The Cancer

Genome Atlas (TCGA). Specifically, we developed a mutual in-

3462 Cell Reports 26, 3461–3474, March 19, 2019

formation-based algorithmic pipeline to

screen for TF-target transcriptional regu-

lation circuits that depend on the methyl-

ation levels of specific promoter CpG

sites for each cancer type (Figure 1A).

Cancer type-specific DNA methylation-

dependent transcription regulatory net-

works (MeTRNs) were assembled from

these circuits for 21 types of cancer and

then cross validated with ChIP-seq data

of 161 TFs. We further showed that the

expression dynamics of many cancer-

related genes can be largely attributed

to these epigenetically modulated tran-

scriptional regulatory circuits in the can-

cer type-specificMeTRNs. Finally, inmul-

tiple cancer types, the functional CpG

sites coupled with the TFs in the MeTRNs

serve as effective classifiers for prognos-

tically different patient subgroups. We believe that such compre-

hensive collections of the context-specific transcriptional regu-

latory circuits and the modulatory CpG sites will be a resource

for dissecting the driving forces of gene expression dysregula-

tion in tumors and understanding the transcriptional regulation

machinery underlying cancer.

RESULTS

Identification of the DNA Methylation Level-DependentTranscriptional Regulation CircuitsPromoter DNA methylation at CpG sites has been recognized as

a critical modulator of gene transcription by interfering with TF

binding (Hu et al., 2013; Jones, 2012; Liu et al., 2012). Using

the gene expression and promoter CpG methylation profiles

of the tumor samples from 21 major cancers in TCGA (see

Table S1 for a summary of the samples) (Liu et al., 2018), we per-

formed a comprehensive survey of the correlations between

gene expression and promoter CpG methylation levels across

tumors for each of the 21 cancers (statistics of the CpG-gene

pairs are supplied in Liu et al., 2018). As expected, many of the

promoter CpG-gene pairs showed medium to high levels of cor-

relations between the CpG methylation and gene expression

profiles (Liu et al., 2018), and two examples are shown in Figures

1B and 1C.

On the basis of our current understandings about the function

of promoter CpG methylation in meddling binding of TFs to

target genes, we set a goal to systematically identify the

methylation level-dependent transcriptional regulation circuits

in different cancers, on a genome-wide scale and at the sin-

gle-CpG site level. Using well-designed tools in the information

theory (Kraskov et al., 2004; Wang et al., 2009a, 2009b), we

developed an analysis pipeline (Figure 1A) that takes advantage

of the highly dynamic tumor multi-omics data in TCGA to

assess the involvement of specific promoter CpG sites in deter-

mining the association between TFs and their potential target

genes for each cancer type. Specifically, for each type of can-

cer, we calculated the conditional mutual information (cMI) be-

tween expression profiles of a gene and a candidate TF in the

tumor samples, given the methylation profile of a promoter

CpG site of this gene, i.e., cMI (TF, Gene j CpG). This quantifies

the additional predictive information that the methylation status

of this CpG site provides toward the gene expression that is

transcriptionally driven by the TF. In other words, the cMI is

an assessment of the dependency of a transcriptional regula-

tion on the methylation level of a CpG site. In practice, because

the DNA copy number is also a major defining factor of gene

expression, we incorporated the CNV data to preclude the ef-

fect of DNA copy number when quantifying the modulating

function of CpG site methylation on transcriptional regulations.

As shown by the analysis pipeline in Figure 1A, the cMI for each

possible combination of a gene j, a CpG site m in the promoter

region, and a TF i, which forms a triplet (i, j, m), was calculated

and tested for statistical significance through two steps. The

first step was implemented to search for the candidate TF-

target circuits that depend on promoter CpG methylation

and/or DNA copy number, and the second step further identi-

fied the CpG methylation-dependent TF-target associations

from the list narrowed down by the first step. Both steps used

the strategy of sample shuffling to estimate the real false dis-

covery rate (FDR) by comparing the calculated cMIs with null

distributions of cMIs generated by multiple types of sample

permutations.

In addition, to further estimate the final FDRs of our pipeline,

we added spike-in data entries at the very beginning of the

analysis pipeline (Figure 1A). Specifically, negative data of

randomly selected genes was simulated by sample permuta-

tion and added into the real datasets as inputs of our pipeline.

These entries, by definition, are true negatives. We calibrated

the cut-offs of the two major steps in our pipeline by control-

ling the discovery rates of these true negatives. Specifically,

the first step recovered fewer than 5% of these true negatives,

which entered into the second step. More conservative FDR

cut-offs were used for the second step, which eventually

removed all the simulated true negatives. See the Method De-

tails section for more details.

Eventually, for each type of cancer, the genome-wide screen

yielded a collection of triplets, of which the statistical signifi-

cance of the cMI passed the filters shown in Figure 1A (Data

S1). These triplets, representing the TF-target transcriptional

regulation circuits that depend on the DNA methylation levels

of specific CpG sites, were summarized as a network, named

as DNA MeTRN, for each of these 21 types of cancer (Data

S1). Basic statistics of these networks are summarized in

Table S2 and Figure S1A. On average, each network involves

more than 1,000 TFs regulating �5,000 target genes, which

are modulated by �16,000 CpG methylation sites. A web-

based search portal has been built for the MeTRNs (http://

labyang.com/resources/metrn/table.html), which allows users

to search for the MeTRN circuits in one or more cancers based

on inputs of TF(s), target gene(s), or promoter CpG site(s) of

interest.

Systematic Cross-Validations and Examples of theMeTRN CircuitryGiven the large scales of the MeTRNs, we used 161 TF-specific

ChIP-seq datasets for systematic validations of the transcrip-

tional regulation circuits in MeTRNs. The datasets for 161 TFs

in various cell contexts, which are mostly cell lines, were

collected from the Encyclopedia of DNA Elements (ENCODE)

project (Neph et al., 2012b). Each dot in Figure S1B represents

a TF and shows the percentage of the predicted targets in a

MeTRN that are supported by the ChIP-seq data of that partic-

ular TF. For each cancer type, validation rates of the predicted

targets of approximately 100 TFs can reach higher than 90%

or lower than 10% with a median of approximately 40% (Fig-

ure S1B). The apparently low validation rates for some TFs are

not surprising, given that the ChIP-seq datasets were generated

just in one or two cell lines, and transcriptional regulation has

been well recognized for its high context specificity. More impor-

tantly, most of the ChIP-seq experiments and the peak-calling

procedures were optimized to control for false positives with

costs of high false negatives. In fact, even for ChIP-seq datasets

of the same TF but from different biological replicates or different

cell types, their shared ChIP-seq peaks can be as low as 10%

and usually will not be higher than 60% (data not shown), which

is also supported by similar observations in literature (Yang et al.,

2014).

In addition, it is worth noting that MeTRN circuitry by definition

is not a comprehensive prediction of all the targets for each TF.

Instead, MeTRN enriches and captures the TF-target circuits

that depend on promoter CpG site(s), while the TF ChIP-seq sig-

nals do not differentiate CpG-dependent and independent TF-

binding targets. Since DNA methylome is known for its

context-dependency, the CpG-dependent TF-target circuits

presumably should be more dependent on the tissue context

than other regular TF-target circuits are. Therefore, it is highly

possible that the ChIP-seq signals could miss many of the

MeTRN targets.

Nevertheless, to further assess involvements of the CpG

sites in the transcriptional regulatory circuits in the MeTRNs,

for each TF, we compared the ChIP-seq peak signals around

Cell Reports 26, 3461–3474, March 19, 2019 3463

http://labyang.com/resources/metrn/table.html


A

B

C

Figure 2. Target Genes of 3 TFs, JUNB,

STAT1, and JUND, which Are Present in the

MeTRNs

(A–C) Target genes of 3 TFs, JUNB (A), STAT1 (B),

and JUND (C), which are present in the MeTRNs of

at least 3 cancer types. The biological functions and

processes enriched in the target gene lists were

provided, and the genes annotated to an enrich-

ment term were shown by a line connecting the

term and the gene. Target genes supported by TF-

specific ChIP-seq data were highlighted with blue

outline.

the predicted CpG sites of the target genes with the peaks

around the other promoter CpG sites of the same genes but

not included in the MeTRNs. Analyses of the ChIP-seq data for

161 TFs showed that, in general, the modulatory CpG sites

recovered by the MeTRNs for each TF do have stronger ChIP-

seq peak signals, i.e., higher probabilities of TF binding, than

the other CpG sites that are not in the MeTRNs (Figure S1C).

Taken together, the ChIP-seq data of these 161 TFs offered

high-throughput cross validations of the MeTRNs and

genome-wide assessments of the MeTRNs in recapitulating

the transcriptional regulation circuits that depend on DNA

methylation levels at specific CpG sites.

Finally, to show examples of the regulatory circuits encoded

by the MeTRNs, we focused on the known cancer-related TFs

(more than 200, sorted by median of the target numbers in 21

MeTRNs; Data S2) and looked at their regulatory targets in the

MeTRNs. Three TFs (JUNB [OMIM: 165161]), STAT1 [OMIM:

600555], and JUND [OMIM: 165162]), which have ChIP-seq

data and are among the top 10 TFs with the largest numbers


of target genes, were selected as exam-

ples and their targets in the MeTRNs of

at least 2 cancer types were shown in Fig-

ure 2. The target genes supported by TF-

specific ChIP-seq data were highlighted

in the figures. Next, functional enrichment

analyses of these target genes were per-

formed. For example, the target genes of

JUNB are generally enriched in such pro-

cesses as mitogen-activated protein ki-

nase (MAPK) signaling pathway, NF-kB

pathway, inflammatory response, and

cancer pathways (Figure 2A). The targets

of STAT1 were enriched in the processes

such as ARF6 trafficking, cell-cell adhe-

sion, and metabolism (Figure 2B). The tar-

gets of JUND were generally enriched in

development, signaling pathways, meta-

bolism, and angiogenesis (Figure 2C).

Most of these TF-target regulatory circuits

have not been previously investigated in

detail, although many of them are indeed

supported by TF-specific ChIP-seq data.

These target gene sets of the TFs recov-

ered by the MeTRNs shed lights on poten-

tial down-stream functions of the TFs, which represents a value

of the MeTRNs as a resource for extracting biological insights

and generating testable hypotheses.

Prediction Power of the MeTRNs for Gene ExpressionProfiles in CancersThemRNA expression of a gene is subjected to multiple levels of

regulation, including transcriptional regulations at the DNA level

and post-transcriptional regulation. As previously discussed, the

MeTRNs serve as maps of cancer type-specific transcriptional

regulation circuitry that is dependent on the methylation levels

of specific promoter CpG sites. To estimate the weight of these

regulatory effects in the overall regulation of each particular gene

in a given cancer context, we applied regression models for the

gene expression profiles. Specifically, asmutual information was

used in our pipeline to assess dependency, from which the

MeTRNs were inferred, we used a different model to assess

the prediction powers of the TF-CpG regulation circuits in

MeTRNs to avoid circular definition. Since the ground truth of

A

B C D

Figure 3. Prediction Powers of the MeTRNs for Gene Expression Profiles(A) Linear regression models with different predictor variables (Target�TF+CpG sites, Target�CNV, and Target�TF+CpG sites + CNV) were used to fit the

expression profile of each target gene in the MeTRN for each cancer type. The coefficient of determination (R2) for each gene was calculated from these models

with data from a particular cancer type. Boxplots were prepared to show the distributions of the R2 values of all the genes with these 3 different models for each

cancer type. Whiskers extend to 1.5 times of interquartile range.

(B) A scatterplot example (BLCA) showing the R2 values of each gene from the two regression models (Target�TF+CpG sites and Target�CNV). The dots are

colored by the R2 values of each gene from the combinatory linear regression model of Target�TF+CpG+CNV. Several diagonals were marked by gray lines, on

which the x axis value plus the y axis value of the dots are the same. The same scatterplots for the other cancer types are provided in Figure S3.

(C and D) The biological and physiological processes enriched in the top genes that are highly dependent on the TF-CpG circuits in the MeTRNs (R2 of

Target�TF+CpG > 0.4; C) and the processes enriched in the top genes that are highly dependent on the CNV (R2 of Target�CNV > 0.4; D). Saturation of the color

indicates the statistical significance (–log10(Pv)) of each term.

joint regulation between TF and DNA methylation on target gene

expression is still unclear, we chose linear regression model as a

convenient tool for a rough estimation of the prediction powers.

The expression profile of each target gene in a cancer type-spe-

cific MeTRN was modeled with a linear combination of the

methylation levels and expressions of the gene’s regulating

CpG sites and TFs, respectively, as predicted by the MeTRN

(Target�TF+CpG). From these linear regression models, the co-

efficient of determination (R2) for each gene in the MeTRN was

calculated and adjusted for the number of explanatory terms

as an estimation of how much the gene expression could be

determined by the CpG-involved TF-target transcriptional regu-

lation machinery mapped by the MeTRN of each cancer type

(Data S3). Boxplots of these adjusted R2’ values for all the target

genes in eachMeTRN (Figure 3A, Target�TF+CpG) showed that

the combinations of coupled TFs and CpG sites in each MeTRN

have highly variable prediction powers for different genes.

Importantly, the overall R2 distributions from the combinatory

model of the coupled TFs and CpG sites are significantly higher

than the R2 values from themodel of just the CpG sites or just the

TFs (Figure S2A), indicating a cooperative expression-deter-

mining power of the coupled TFs and CpG sites in the MeTRNs.


Finally, it is worth noting that while a good fit of the linear model

(high R2) indicates a strong determination potential, such linear

regression models actually generated relatively conservative es-

timations of the prediction powers, because thesemodels would

underestimate the non-linear association patterns. Therefore, in

our results, although high R2 values do indicate strong linear pre-

diction power, low R2 values do not necessarily indicate lack of

prediction power. They may be simply due to non-linearity of

the association between the target gene and the determining

factors.

Next, to compare the prediction power of the combined CpG

sites and TFs mapped by the MeTRNs with another major gene

expression driver, the DNA copy numbers, we also used linear

regression models to estimate the coefficient of determination

(R2) of the CNV for each gene (Target�CNV in Figure 3A, and

detailed R2 values in Data S3). In most of the cancers, the

range of the R2 values of the model Target�TF+CpG are

generally similar to or higher than the results from the model

Target�CNV, which suggests that the DNA methylation-depen-

dent transcriptional regulation mapped in the MeTRNs is an

important expression regulation machinery that is as potent

as the CNV.

All the comparisons above were based on overall distributions

of the R2 values. For a more specified comparison between the

prediction powers of the DNA methylation-dependent transcrip-

tional regulation and the DNA copy number for each gene, we

prepared scatterplots showing the R2 (Target�TF+CpG) and

the R2 (Target�CNV) of each gene for each cancer type (bladder

urothelial carcinoma [BLCA] in Figure 3B as an example, and all

the remaining 20 cancers are shown in Figure S3). Clearly, in

each cancer type, some genes appear to heavily rely on the

TF-CpG regulation circuits in the MeTRN, while some appear

to be primarily determined by the CNV (Figures 3B and S3). In

addition, most of the genes fall in the lower-left triangle of the

plots, suggesting that genes heavily depending on DNA methyl-

ation-modulated transcriptional regulation tend not to be highly

associated with the CNV, and vice versa. Therefore, these two

types of regulations are complementary to each other for deter-

mination of the general mRNA expression program.

Finally, another set of linear regression models that combine

all the factors discussed above were used to fit the gene expres-

sion profiles in each cancer type (Target�TF+CpG+CNV)

(Figure 3A). Interestingly, colors of the dots in Figures 3B and

S3, which indicate the R2 values from the combinatory model

(Target�TF+CpG+CNV), are largely consistent along the diago-

nals (illustrated by gray lines). This pattern means that on a per-

gene base, the combinatory R2 is generally equivalent to a sim-

ple summation of the two R2’s from the transcription-only model

(Target�TF+CpG) and the CNV-only model (Target�CNV). In

other words, the two types of factors (TF-CpG combination

and CNV) are independent of each other for determining the

expression profiles of specific genes.

Genes that are heavily regulated via themethylation-dependent

transcriptional machinery (R2 from the model Target�TF+CpG

greater than 0.4) have different percentages of overlaps across

different cancer types (Figure S2B). Interestingly, the biological

and physiological processes enriched for these top genes (R2

(Target�TF+CpG) greater than 0.4) have relatively large overlaps


across different cancer types (Figure 3C), and many of them

havebeen shown tobehighly related to tumorigenesis andcancer

development, for example, cell adhesion, cytoskeleton organiza-

tion, development, differentiation, signaling pathways in cancer,

and regulationof cell death.Note that thegeneswithhighadjusted

R2 in the simple model of Target�CpG or Target�TF showed

much fewer and different functional enrichments (Figures S2C

and S2D). This outcome suggests that the MeTRN circuits of

coupled TFs andCpG sites, instead of theCpG sites or TFs alone,

are main contributors of gene expression regulation related to

some key cancer processes. In contrast, generally fewer and

different biological processes are enriched in genes that are

dependent on the CNV (R2 from the model Target�CNV greater

than 0.4) in each cancer (Figure 3D). These processes include

some very general terms such as RNA or DNAmetabolism,modi-

fication, andmaturation, and theyhave limitedoverlapsacross the

21 cancer types, and (Figure 3D).

Regulation of Cancer-Related Genes through theMeTRNsWe believe that the MeTRNs serve as a rich resource that may

shed lights on some key gene expression regulation machin-

eries, for example, which regulators could be involved in deter-

mining the expression dynamics of the known cancer-related

genes in different cancers. Focusing on these cancer-related

genes, we summarized their dependencies on the CNV and/or

the DNA methylation-dependent transcriptional regulation cir-

cuits in the MeTRNs for different cancers (Figure 4A; Data S4).

Some of the cancer genes showed consensus across different

cancers in their dependencies on the CNV or theMeTRNcircuits.

Figure 4B listed the top 10 cancer-related genes that are highly

dependent on the MeTRN circuits. Note that the R2 values of

these 10 genes from the linear models with CpG sites only

(Target�CpG) or TFs only (Target�TF) are much lower than

those from the TF+CpGmodels in the same cancers (Figure 4B),

which again suggests the combinatory prediction power of the

coupled TFs and CpG sites in the MeTRNs.

For example, the top gene, FLI1 (OMIM: 193067), is strongly

dependent on the MeTRN in 16 of 21 cancers (Figure 4B; Data

S4). Interestingly, as a TF and a proto-oncogene, FLI1 has

been found to be strongly dysregulated via its promoter DNA

methylation levels in various diseases, including the autoimmune

disease scleroderma (Wang et al., 2006), leishmaniasis (Almeida

et al., 2017), andmultiple cancers, such as gastric and colorectal

cancers (Lin et al., 2015; Sepulveda et al., 2016). Therefore, it is

not surprising that the promoter CpG sites, together with specific

TFs, play a major role in determining the expression of FLI1 in

multiple cancers. Furthermore, the MeTRN circuitry provides a

detailed framework of which CpG sites were coupled with spe-

cific TFs in executing such regulatory function. Figure 4C illus-

trated all the TF-CpG circuits that regulate FLI1 in at least 3 can-

cer types. ZC4H2 (OMIM: 300897) appeared to be the most

frequent TF (in 5 cancers) for FLI1 expression regulation (Fig-

ure 4C). In literature, it is still largely unknown which TFs are

involved in the transcriptional regulation of FLI1 in cancers,

and the TF-CpG combinations that we present here could be

candidates for further investigations. Take stomach adenocarci-

noma (STAD) as an example to study the cancer type-specific

A

B

C D

Figure 4. Prediction Powers of the MeTRNs and CNV for Expression Profiles of the Cancer-Related Genes

(A) Linear regression models with different predictor variables were used to fit the expression profile of each cancer-related gene in the MeTRN in the tumors for

each cancer type. The coefficient of determination (R2) for each gene was calculated from these models (Target�TF+CpG sites, Target�CNV, and

Target�TF+CpG sites + CNV) with data from a particular cancer type. Finally, boxplots were prepared to show the distributions of the R2 values of all the genes

with these 3 different models for each cancer type. Whiskers extend to 1.5 times of interquartile range.

(B) Top 10 genes were shown as examples of the cancer genes that are highly dependent on the MeTRN regulators in multiple types of cancer, as shown by their

R2 values from the linear regressionmodel of Target�TF+CpG sites. The specific R2 values from themodel of Target�TF+CpG, Target�CpG sites, or Target�TFs

in particular cancer types weremarked with vertical bars, of which the color indicates the type of cancer. Numbers of cancer types in which a gene was found as a

target in the MeTRN networks are shown in parentheses following the gene names.

(C) 5 TFs predicted to target FLI1 in at least 3 types of cancers. 42 out of 46 promoter CpG sites were involved in these transcriptional circuits.

(D) 18 TFs predicted to target FLI1 in STAD. 12 of these 18 TFs were predicted to target FLI1 only in STAD. 36 of the FLI1 promoter CpG sites were involved.

regulation. Recently, FLI1 has been shown to be subjected to

DNA methylation-mediated dysregulation in gastric cancer (Se-

pulveda et al., 2016), but the detailed mechanisms of such regu-

lation are not clear. TheMeTRN of STAD hasmapped 18 TFs and

36 CpG sites for the transcriptional regulation circuitry of FLI1

(Figure 4D). Among these 18 TFs, EHF (OMIM: 605439) has

been shown to target FLI1 according to the ChIP-seq data. In

brief, the case of FLI1 as an example shows what types of in-

sights into the gene transcriptional regulation program can be

extracted from the MeTRNs as a resource.

Finally, considering the dynamic gene expression determina-

tion powers of the MeTRNs across different cancer contexts,

we evaluated the similarities among the 21 cancer types in de-

pendencies of the cancer genes on the cancer-specific MeTRN


Figure 5. Similarities between Cancers

Based on Prediction Powers of the MeTRNs

and CNV for the Cancer Gene Expressions

(A) Cancer similarity matrices indicating the Pear-

son correlation between the R2 values of the can-

cer genes in each pair of the cancers. The R2

values were calculated from linear regression

models of Target�TF+CpG (upper triangle) or from

the model of Target�CNV (lower triangle).

(B) Scatterplots showing the dependencies of the

cancer genes on the MeTRN regulators (left) or on

the CNV (right), in two cancers as examples, STAD

and LUAD.

circuitry. Specifically, Pearson’s correlation between the R2

values of the cancer genes for each pair of the cancers, from

the model of Target�TF+CpG, was used as a quantification of

the similarity between the two cancers (Figure 5A, upper

triangle). With the same strategy, the cancer similarities were

also calculated based on the R2 values from the model of

Target�CNV (Figure 5A, lower triangle). It appears that, in gen-

eral, dependencies of the cancer genes on the cancer-specific

MeTRNs vary substantially across different types of cancers,

whereas their dependencies on the CNV are relatively more

conserved across cancers (Figure 5A). For example, scatterplots

of the R2 values of the cancer genes in two cancers, STAD and

lung adenocarcinoma (LUAD) (Figure 5B), showed amuch higher

correlation of the R2 values from Target�CNV than the R2 values

from Target�TF+CpG between the two cancers. Indeed, many

commonly known cancer genes bear copy number alterations

in multiple types of cancers, which could result in similar depen-

dencies of these genes on CNV. In contrast, it has been shown

that different tissues or cancers have highly heterogeneous

DNA methylomes (Kundaje et al., 2015; Schultz et al., 2015; Var-

ley et al., 2013), which could give rise to different levels of depen-

dencies of gene expression on CpG-modulated transcriptional

regulations across cancers. Therefore, our results indicate that

an inter-cancer heterogeneity potentially resulted from highly

dynamic and heterogeneous transcriptional regulation that is

dependent on the context-specific DNA methylomes.


Classifications of Cancer SubtypesBased on the Regulatory Factors inthe MeTRNsAs previously discussed, by mapping

the transcriptional regulation circuits

composed of promoter CpG sites

coupled with specific TFs, the MeTRNs

recapitulated an important up-stream

layer of gene expression regulation.

Here, we further explored potential of

the contributing factors in the MeTRNs,

i.e., TFs and CpG sites, as classifiers of

cancer patients. Again, we selected TF-

CpG circuits that perform well in deter-

mining the target gene expressions (R2 >

0.5) for each cancer type. These collec-

tions of CpG sites and TFs were then

used for an unsupervised clustering anal-

ysis to classify patients of each cancer type (kidney renal papil-

lary cell carcinoma [KIRP] as an example shown in Figure 6A and

the other 20 in Figure S4). Cancer type-specific Kaplan-Meier

survival curves were then prepared for each of the patient sub-

groups (KIRP in Figure 6B and the other 20 in Figure S5). Ten

of the 21 cancer types, namely, KIRP, brain lower grade glioma

[LGG], pancreatic adenocarcinoma [PAAD], breast invasive car-

cinoma [BRCA], sarcoma [SARC], acute myeloid leukemia

[LAML], kidney renal clear cell carcinoma [KIRC], liver hepatocel-

lular carcinoma [LIHC], skin cutaneous melanoma [SKCM], and

cervical squamous cell carcinoma and endocervical adenocarci-

noma [CESC], showed significantly different survival curves

among different patient subgroups (smallest p value < 0.05) (Fig-

ures 6B and S5), which suggests that the TFs and CpG sites that

have major roles in determining the target gene expression are

potential prognostic biomarkers for these cancers. The other

cancer types did not show significantly different prognoses

associated to the classifications (p > 0.05, Figures S4 and S5),

suggesting that for these cancers, the epigenetically encoded

transcriptional regulation mapped by the MeTRNs may not be

a dominating determinant of cancer aggressiveness.

Previously, the DNA methylome profiles (usually for the most

variable CpG sites) were often used for classification of cancer

subtypes in typical cancer genomics studies (Cancer Genome

Atlas Network, 2012; Cancer Genome Atlas Research Network,

2011, 2012, 2014). In many cases, such cancer subtypes also

A B

DC

Figure 6. TFs and CpG Sites in MeTRNs

Serve as Classifiers of Prognostically

Different Patient Subgroups

(A) Take KIRP as an example. Unsupervised hier-

archical clustering analysis was performed with

the CpG site methylation and TF expression pro-

files of the DNA methylation-dependent tran-

scriptional regulation circuits in the MeTRN that

exhibited strong prediction power for the target

gene expression profiles, as shown from the linear

combinatory model Target�TF+CpG. The patient

subgroups (k1–k4) from the clustering analysis are

marked by different colors. Similar results for the

other 20 cancer types are provided in Figure S4.

(B–D) Kaplan-Meier survival curves showing

comparisons of the overall survival between

different subgroups of KIRP patients identified

based on their MeTRN regulators (B), the top

variable CpG sites (C), or the TFs (D) predicted by

ARACNe. The p value for the statistical signifi-

cance of the largest prognosis difference among

the cancer subtypes was inferred with a log-rank

test.

The survival curves of the other 20 cancer types

are shown in Figures S5–S7.

showed certain levels of prognostic differences. For the purpose

of comparison with the subtype classifications based on the

MeTRN regulators, we followed the canonical strategy and per-

formed tumor-clustering analysis with the promoter CpG sites

that have the most variable methylation profiles for each type

of cancer. The survival curves of these cancer subtypes are

shown in Figure 6C for KIRP and Figure S6 for the other 20 can-

cers, and similarly, the prognostic differences were quantified

through p values. Take KIRP as an example, which exhibited

the most significant prognosis differences. Compared with the

CpG sites with the most variable methylation profiles, the

coupled TFs and CpG sites in the MeTRN indeed resulted in

wider separations of the survival curves and a more significant

p value (Figures 6B and 6C). Such superior classification poten-

tial of theMeTRN regulators to that of the dynamic CpG siteswas

observed in many of the other 20 cancer types with some excep-

tions, in which the different patient classification strategies re-

sulted in similar levels of prognostic differences (Figures S5

and S6). In addition, we reconstructed TF-target transcriptional

regulation networks with ARACNe (Basso et al., 2005; Margolin

et al., 2006), which does not take into account the DNA methyl-

ation profiles. We then followed the same pipeline and selected

the TFs with strong prediction powers for target gene expression

profiles. Similarly, we tested the performances of these TFs

Cell Rep

alone in defining prognostic cancer sub-

types. As shown in Figures 6D and S7,

in KIRP and many other cancer types,

this strategy of patient classification was

also outperformed by the TFs and CpGs

in combination from the MeTRNs as clas-

sifiers. Therefore, by coupling the regula-

tory TFs and the modulating CpG sites,

the MeTRNs provide an alternative strat-

egy for classifying the cancer subtypes, which is indeed associ-

ated with the prognoses, although further investigations would

be needed to fully elucidate the physiological relevance and

the molecular basis of these observations.

Finally, to further investigate the biological relevance of the

cancer classifications based on the MeTRN regulators, we

took the two subtypes of KIRP, k1 and k3, as examples and

looked into their difference. These two subtypes, classified by

the TFs and CpG sites shown in Figure 6A, have themost distinct

survival curves among all the KIRP subtypes (Figure 6B). The

5-year survival rate for patients in k1 was about 95%, while

most of the patients in the aggressive subtype k3 died within

2 to 3 years. Therefore, it is of great value to elucidate which

and how transcriptional regulation circuits were altered in sub-

type k3 with the poorest clinical outcome. Differential methyl-

ation analysis of the CpG sites used for the clustering analysis

in Figure 6A showed that a major proportion of the strong and

significant methylation variations were upregulations in k3

versus k1 (Figure 7A). On the other hand, the gene differential

expression analysis of the TFs used for the subtype classifica-

tions showed the opposite trend, i.e., a major proportion of the

significant TF expression variations were downregulations in k3

versus k1 (Figure 7B). The top 781 upregulated CpG sites and

the top 17 downregulated TFs in k3 were selected (Figures 7A

orts 26, 3461–3474, March 19, 2019 3469

A B

C D

Figure 7. Comparison between Two Prognostically Different KIRP Subtypes Classified by MeTRN Regulators

(A and B) Differential methylation (A) and differential expression (B) analyses of the MeTRN regulators (CpG sites and TFs, respectively) that were used for the

clustering analysis in Figure 6. 781 CpG sites and 17 TFs were deemed strongly up- and downregulated, respectively, in k3 versus k1.

(C) 598 target genes of the CpG sites and TFs identified in (A) and (B) were found in the KIRPMeTRN. GSEA analysis was performed to show enrichment of the 598

target genes in the downregulated genes by comparing k3 versus k1.

(D) Gene functional enrichment analysis of the 598 target genes. The enrichment p value cut-off was set at 0.01.

and 7B), and they were involved in the regulatory circuits of 598

target genes in the KIRP MeTRN (Figure 7C). A gene set enrich-

ment analysis (GSEA) showed that these 598 genes were signif-

icantly enriched in the downregulated genes in k3 versus k1, with

the differential expression profile of all the genes as background

(Figure 7C; Data S5). In other words, downregulation of the TFs

and upregulation of the CpG sites were associated to repression

of their target genes mapped by the MeTRN. Furthermore, these

598 genes were significantly enriched by various key functions or

processes related to cancer, such as cell adhesion, differentia-

tion, vasculogenesis, Wnt signaling, BRD4 complex, and cell

proliferation (Figure 7D).

In summary, the cancer-specific MeTRNs did not only provide

classifiers of cancer subtypes that are physiologically and bio-

logically relevant, but also mapped the down-stream gene

expression changes that were resulted from perturbations of

the transcriptional regulation circuitry at the TFs or CpG sites.

Therefore, by coupling the TF-target transcriptional regulatory

circuits with specific promoter CpG sites, the MeTRNs is a


unique resource for understanding the transcriptional regulation

machinery underlying cancer and dissecting the driving forces of

gene expression dysregulation in tumors.

DISCUSSION

Methylation levels of the promoter CpG sites have been recog-

nized as potent regulators of gene expression (Robertson,

2005; Smith and Meissner, 2013), and dysregulation of the pro-

moter CpG methylation has been connected to a wide range of

physiological consequences, including cancer (Chatterjee and

Vinson, 2012; Jones, 2012; Liu et al., 2018). Most previous

studies were focused on specific CpG sites or regions and

have derived detailed epigenetic gene regulation machineries.

A comprehensive survey of the potential functions of the cancer

context-specific DNAmethylome, which should have substantial

value for understanding the multi-level gene expression regula-

tion machinery in cancers, is still largely unavailable at the

single-CpG site level.

Our survey of the CpG-gene correlations revealed significant

numbers of CpG sites that are correlated, but not at a very

strong level, with the gene expression. Many of the CpG-

gene pairs showed very interesting patterns as shown in Fig-

ures 1B and 1C. We reason that for a given gene whose TF

(or TFs) is redundantly available, its expression level would

mainly depend on the accessibility of the TF protein to the pro-

moter region. In such cases, if the TF binding depends on

methylation (or demethylation) of a CpG site, then the methyl-

ation level of this CpG site would serve as a rate-limiting factor

of the transcription activation or repression, which would lead

to a strong correlation between DNA methylation and gene

expression. On the other hand, in a similar scenario but with

a limited supply of the TF protein, given a certain fixed level

of DNA methylation, gene transcription would be sensitive to

abundance of the TF. However, eventually the DNA methylation

level determines the upper or lower limit that the gene expres-

sion can reach by controlling the maximum accessibility to the

CpG site by the TF. This explains the ‘‘triangle-shaped’’ corre-

lation patterns between the CpG site methylation and gene

expression levels exemplified in Figures 1B and 1C. This

prompted us to systematically identify such types of methyl-

ation level-dependent transcriptional regulation circuits on a

genome-wide scale in different cancers.

This task was done by reanalyzing the multi-omics data in

TCGA, which offers unique advantages for integrative data anal-

ysis, including large sample sizes, broad coverage of cancer

types, parallel multi-omics datasets, consistent procedures for

high-throughput profiling, and the availability of the patient sur-

vival data. Most importantly, the inter-tumoral heterogeneity

within each type of cancer produced largely dynamic datasets,

which made it possible to conduct sophisticatedly designed sta-

tistical analyses for mining of data patterns indicating potential

regulation machineries. Specifically, we designed an analysis

pipeline based on the concept of cMI to quantify the dependency

of each possible TF-gene association on the methylation level of

each promoter CpG site. Estimations of MI and cMI require

dynamic datasets with large sample sizes, and that is one of

the reasons why the cancer type-specific multi-omics data in

TCGA is best suited for our analysis pipeline. Outside of TCGA,

datasets from large cancer cohorts that meet the requirements

of our analysis method are extremely rare. The criteria include

large sample size (>150), independent tumor samples, matched

data of RNA expression, DNA methylation, and DNA copy

numbers from each tumor.

In TCGA, the dataset of BRCA has the largest number of

normal tissue samples, among which only 85 were usable for

our pipeline as these samples have matched data of DNA

methylation, mRNA expression, and DNA copy numbers. Unlike

the tumor tissues, the normal tissues are much less heteroge-

neous. Dynamic ranges of the multi-omics profiles of normal tis-

sues are much smaller. Therefore, due to the small sample size

and limited data variation, inference of mutual information and

reconstruction of MeTRN are less reliable. Indeed, the MeTRN

from normal tissue data of BRCA and the MeTRN from tumor

data are very different (data not shown). However, as discussed

above, although such a difference can be attributed to cancer

specificity of MeTRNs, quality of the normal tissue data (smaller

sample size and limited dynamic range) could be another main

reason of the difference between cancer and normal MeTRNs.

It worth noting that the multi-omics profiles used in our study

were obtained by TCGA from bulk tumor samples, which are het-

erogeneous mixtures of tumor cells and others, such as stromal

cells and lymphocytes. Previous studies have comprehensively

evaluated purity of the tumor tissue samples profiled by TCGA

(Aran et al., 2015; Zheng et al., 2017). We took the data from

(Aran et al., 2015), in which the per-sample purities were inferred

with 4 different methods and the consensus were taken. Majority

of the tumors used in our study have purities ranging around 0.8,

and a small proportion of the tumor samples fall below 0.6 (for

supporting information, see Figure 5B in Liu et al., 2018). Some

cancer types have slightly lower tumor purities in general, but still

most are higher than 0.6. Therefore, although tumor heterogene-

ity is an obvious issue here, the tumor tissue samples in TCGA

were still dominated by tumor cells of particular types of cancer.

This is the basis for most of the cancer cohort studies based on

TCGA data or other datasets obtained from tumor tissues. As far

as we know, there is not a consensus strategy to fully address

the issue of tumor heterogeneity for large-scale cohort studies,

and therefore, we followed the common practice and relied on

the assumption that the molecular profiles are largely reflective

of the dominating tumor cells.

In the present study, a DNA MeTRN was assembled for each

of the 21 major cancer types in TCGA. By coupling the specific

CpG sites, TFs, and the target gene, the MeTRN is a de novo

reconstruction of the promoter DNA methylation level-depen-

dent transcriptional regulation circuitry in each cancer type,

which we believe provides the basis for further context-specific

studies of the functional DNAmethylation at the resolution of sin-

gle-CpG sites and particular TFs for specific genes. On the other

hand, the purpose of our study was to capture the promoter CpG

sites that are potentially involved in TF-target regulations, and

our methodology does not presume that the CpG sites being

tested here are the only determining factors. In other words,

our focus on the promoter CpG sites does not preclude other

distal machineries, such as the enhancer CpG sites (or sites in

other potential regulatory regions such as CGI, shores, and

shelves). For these potentially distal regulatory CpG sites, given

their increased distances to TSS, it becomes difficult to precisely

identify the ‘‘target’’ genes that are being regulated by these

sites. In addition, the coverage of these distal CpG sites in the

current data is poor compared to the promoter sites. However,

the current analysis can certainly be repeated for the enhancer

CpG sites if we have precise definitions of the context-specific

enhancer regions for each gene and if the similar large-scale

enhancer CpG site methylation datasets are available.

Our computational analysis pipeline was based on the

concept of cMI, which quantifies the gain of mutual information

between a TF and a target gene due to introduction of the

methylation level of a CpG site in the promoter region. Essen-

tially, the computational analysis itself captured associations

rather than causational regulations. However, our analysis was

limited to each particular gene coupled with the CpG sites in

the promoter region. Therefore, an assumption here is that the

association, if any, between expression of a gene and methyl-

ation level of a promoter CpG site, should be due to a directional


regulation of the gene expression by the promoter CpG site,

which affects accessibility of the promoter to a TF. Although

there could be some exceptions, we believe that such assump-

tion should be valid in most cases and many previous studies

about DNA methylation have been based on the same assump-

tion as well. Similarly, a strong TF-target gene association does

not necessarily indicate a directional regulation circuit from the

TF to the target gene. Many alternative regulation machineries

could result in such strong association, for example the TF and

the target gene both being regulated by another upstream regu-

lator. Therefore, we kept a TF-target gene pair only if the TF-gene

expression association depends on the methylation level of a

promoter CpG site of the target gene. Such a criterion would

exclude many alternative machineries other than a TF-to-target

gene circuit, because these potential alternative machineries

would not be directly dependent on the promoter CpG site.

To showcase the applications of these informative cancer-

specific MeTRN networks in gaining insights into the cancer

type-dependent gene regulatory machinery, we quantitatively

assessed the contributions of the regulatory circuits in the

MeTRNs in gene expression regulation. Our results suggested

that the epigenetically modulated transcriptional regulation

machinery encoded in the MeTRNs is independent from the

DNA copy numbers and is at least equally, if not more, potent

than the DNA copy numbers for determining the gene expres-

sion dynamics in tumors (Figures 3 and S3). Many genes with

no CNV or that are not responsive to their CNVs, including

many known cancer genes, are being controlled via such

epigenetically encoded transcriptional regulation. Cancer-

related biological processes are enriched in these genes,

and subgroups of patients with different methylome patterns

at the CpG sites selected by the MeTRNs showed largely

different prognoses, which suggests potential physiological

consequences of perturbing the transcriptional regulation

scheme at the DNA methylation level. Therefore, MeTRN is a

high-resolution functional survey of the DNA methylation-

dependent transcriptional regulations in cancers. By recapitu-

lating the epigenetic scheme involved in the cancer context-

dependent transcriptional regulation circuitry, our results

serve as a framework and resource for dissecting the driving

force of gene expression regulation related to tumorigenesis

and cancer development.

STAR+METHODS

Detailed methods are provided in the online version of this paper

and include the following:

d KEY RESOURCES TABLE

d CONTACT FOR REAGENT AND RESOURCE SHARING

d METHOD DETAILS

3472

B Data collection and pre-processing

B ChIP-seq data collection and analysis

B Genome-wide identifications of the TF-target regulato-

ry circuits modulated by CpG site methylation

B Implementation of the pipeline

B Functional enrichment analysis

B Patient classification and survival analysis

Cell Reports 26, 3461–3474, March 19, 2019

B Differential expression and differential methylation

analysis

d QUANTIFICATION AND STATISTICAL ANALYSIS

B Software and statistical analysis

B Strategy for randomization

B Data exclusion

d DATA AND SOFTWARE AVAILABILITY

SUPPLEMENTAL INFORMATION

Supplemental Information can be found with this article online at https://doi.

org/10.1016/j.celrep.2019.02.084.

ACKNOWLEDGMENTS

The authors wish to acknowledge support from the Computing Platform and

the Gene Sequencing Platform of National Protein Science Facility (Beijing).

This work was supported by the National Key Research and Development Pro-

gram, PrecisionMedicine Project (2016YFC0906001 to X.Y.), the National Nat-

ural Science Foundation of China (31671381, 91540109, and 81472855 to

X.Y.), the Tsinghua University Initiative Scientific Research Program

(2014z21046 to X.Y.), the Tsinghua–Peking Joint Center for Life Sciences,

and the 1000 Talent Program (Youth Category).

AUTHOR CONTRIBUTIONS

Yu Liu, Yang Liu, and X.Y. conceived and designed the study. Yu Liu devel-

oped the algorithmic analysis pipeline for reconstruction, validation, and anal-

ysis of theMeTRNs, with help from Z.X. and J.W. Yang Liu and R.H. performed

the patient classification analysis and prepared survival curves, with help from

W.S., Y.Y., and S.D. X.Y. supervised the whole project. X.Y. wrote the manu-

script with input from Yu Liu, Yang Liu, and R.H.

DECLARATION OF INTERESTS

The authors declare no competing interests.

Received: March 19, 2018

Revised: October 1, 2018

Accepted: February 21, 2019

Published: March 19, 2019

REFERENCES

Almeida, L., Silva, J.A., Andrade, V.M., Machado, P., Jamieson, S.E., Car-

valho, E.M., Blackwell, J.M., and Castellucci, L.C. (2017). Analysis of expres-

sion of FLI1 andMMP1 in American cutaneous leishmaniasis caused by Leish-

mania braziliensis infection. Infect. Genet. Evol. 49, 212–220.

Aran, D., Sirota, M., and Butte, A.J. (2015). Systematic pan-cancer analysis of

tumour purity. Nat. Commun. 6, 8971.

Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., and Cal-

ifano, A. (2005). Reverse engineering of regulatory networks in human B cells.

Nat. Genet. 37, 382–390.

Blattler, A., and Farnham, P.J. (2013). Cross-talk between site-specific tran-

scription factors and DNA methylation states. J. Biol. Chem. 288, 34287–

34294.

Budden, D.M., Hurley, D.G., Cursons, J., Markham, J.F., Davis, M.J., and

Crampin, E.J. (2014). Predicting expression: the complementary power of his-

tonemodification and transcription factor binding data. Epigenetics Chromatin

7, 36.

Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of

human breast tumours. Nature 490, 61–70.

Cancer Genome Atlas Network (2015). Comprehensive genomic characteriza-

tion of head and neck squamous cell carcinomas. Nature 517, 576–582.



http://refhub.elsevier.com/S2211-1247(19)30270-0/sref1




















Cancer Genome Atlas Research Network (2011). Integrated genomic analyses

of ovarian carcinoma. Nature 474, 609–615.

Cancer Genome Atlas Research Network (2012). Comprehensive genomic

characterization of squamous cell lung cancers. Nature 489, 519–525.

Cancer Genome Atlas Research Network (2014). Comprehensive molecular

profiling of lung adenocarcinoma. Nature 511, 543–550.

Chatterjee, R., and Vinson, C. (2012). CpGmethylation recruits sequence spe-

cific transcription factors essential for tissue specific gene expression. Bio-

chim. Biophys. Acta 1819, 763–770.

Domcke, S., Bardet, A.F., Adrian Ginno, P., Hartl, D., Burger, L., and

Sch€ubeler, D. (2015). Competition between DNAmethylation and transcription

factors determines binding of NRF1. Nature 528, 575–579.

Frenzel, S., and Pompe, B. (2007). Partial mutual information for coupling anal-

ysis of multivariate time series. Phys. Rev. Lett. 99, 204101.

Hu, Z., Killion, P.J., and Iyer, V.R. (2007). Genetic reconstruction of a functional

transcriptional regulatory network. Nat. Genet. 39, 683–687.

Hu, S., Wan, J., Su, Y., Song, Q., Zeng, Y., Nguyen, H.N., Shin, J., Cox, E., Rho,

H.S., Woodard, C., et al. (2013). DNA methylation presents distinct binding

sites for human transcription factors. eLife 2, e00726.

Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Sci-

ence & Engineering 9, 90–95.

Jiang, P., Freedman,M.L., Liu, J.S., and Liu, X.S. (2015). Inference of transcrip-

tional regulation in cancers. Proc. Natl. Acad. Sci. USA 112, 7731–7736.

Johnson, D.S., Mortazavi, A., Myers, R.M., andWold, B. (2007). Genome-wide

mapping of in vivo protein-DNA interactions. Science 316, 1497–1502.

Jones, P.A. (2012). Functions of DNA methylation: islands, start sites, gene

bodies and beyond. Nat. Rev. Genet. 13, 484–492.

Kraskov, A., Stogbauer, H., and Grassberger, P. (2004). Estimating mutual in-

formation. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 69, 066138.

Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi,

A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M.J., et al.; Roadmap Epige-

nomics Consortium (2015). Integrative analysis of 111 reference human epige-

nomes. Nature 518, 317–330.

Li, Y., Liang, M., and Zhang, Z. (2014). Regression analysis of combined gene

expression regulation in acute myeloid leukemia. PLoS Comput. Biol. 10,

e1003908.

Lin, P.C., Lin, J.K., Lin, C.H., Lin, H.H., Yang, S.H., Jiang, J.K., Chen, W.S.,

Chou, C.C., Tsai, S.F., and Chang, S.C. (2015). Clinical Relevance of Plasma

DNA Methylation in Colorectal Cancer Patients Identified by Using a

Genome-Wide High-Resolution Array. Ann. Surg. Oncol. 22 (Suppl 3),

S1419–S1427.

Liu, Y., Toh, H., Sasaki, H., Zhang, X., and Cheng, X. (2012). An atomic model

of Zfp57 recognition of CpG methylation within a specific DNA sequence.

Genes Dev. 26, 2374–2379.

Liu, Y., Huang, R., Liu, Y., Song,W.,Wang, Y., Yang, Y., Dong, S., and Yang, X.

(2018). Insights from multidimensional analyses of the pan-cancer DNA meth-

ylome heterogeneity and the uncanonical CpG-gene associations. Int. J. Can-

cer 143, 2814–2827.

Marbach, D., Costello, J.C., K€uffner, R., Vega, N.M., Prill, R.J., Camacho,

D.M., Allison, K.R., Kellis, M., Collins, J.J., and Stolovitzky, G.; DREAM5 Con-

sortium (2012). Wisdom of crowds for robust gene network inference. Nat.

Methods 9, 796–804.

Margolin, A.A., Wang, K., Lim, W.K., Kustagi, M., Nemenman, I., and Califano,

A. (2006). Reverse engineering cellular networks. Nat. Protoc. 1, 662–671.

McKinney, W. (2010). Data Structures for Statistical Computing in Python. In

Proceedings of the 9th Python Science Conference, pp. 51–56.

Mermel, C.H., Schumacher, S.E., Hill, B., Meyerson, M.L., Beroukhim, R., and

Getz, G. (2011). GISTIC2.0 facilitates sensitive and confident localization of the

targets of focal somatic copy-number alteration in human cancers. Genome

Biol. 12, R41.

Millman, K.J., and Aivazis, M. (2011). Python for Scientists and Engineers.

Computing in Science & Engineering 13, 9–12.

Misra, S., Pamnany, K., and Aluru, S. (2015). Parallel Mutual Information Based

Construction of Genome-Scale Networks on the Intel� Xeon Phi� Copro-

cessor. IEEE/ACM Trans. Comput. Biol. Bioinformatics 12, 1008–1020.

Neph, S., Stergachis, A.B., Reynolds, A., Sandstrom, R., Borenstein, E., and

Stamatoyannopoulos, J.A. (2012a). Circuitry and dynamics of human tran-

scription factor regulatory networks. Cell 150, 1274–1286.

Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B.,

Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al. (2012b). An

expansive human regulatory lexicon encoded in transcription factor footprints.

Nature 489, 83–90.

Odersky, M., Altherr, P., Cremet, V., Dragos, I., Dubochet, G., Emir, B.,

McDirmid, S., Micheloud, S., Mihaylov, N., Schinz, M., et al. (2004). An Over-

view of the Scala Programming Language (Ecole Polytechnique Federale de

Lausanne).

Oliphant, T.E. (2015). Guide to NumPy, 2nd Edition (CreateSpace Independent

Publishing Platform).

Pedregosa, F., Varoquax, G., Gramfot, A., Michel, V., Thirion, B., Grisel, O.,

Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:

Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.

R Core Team (2008). R: A language and environment for statistical computing.

R Foundation for Statistical Computing (Austria: Vienna). https://www.

R-project.org.

Ranzani, V., Rossetti, G., Panzeri, I., Arrigoni, A., Bonnal, R.J., Curti, S.,

Gruarin, P., Provasi, E., Sugliano, E., Marconi, M., et al. (2015). The long inter-

genic noncoding RNA landscape of human lymphocytes highlights the regula-

tion of T cell differentiation by linc-MAF-4. Nat. Immunol. 16, 318–325.

Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G.,

Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., and Sabeti, P.C. (2011). De-

tecting novel associations in large data sets. Science 334, 1518–1524.

Robertson, K.D. (2005). DNA methylation and human disease. Nat. Rev.

Genet. 6, 597–610.

Sathyamurthy, A., Johnson, K.R., Matson, K.J.E., Dobrott, C.I., Li, L., Ryba,

A.R., Bergman, T.B., Kelly, M.C., Kelley, M.W., and Levine, A.J. (2018).

Massively Parallel Single Nucleus Transcriptional Profiling Defines Spinal

Cord Neurons and Their Activity during Behavior. Cell Rep. 22, 2216–2225.

Schultz, M.D., He, Y., Whitaker, J.W., Hariharan, M., Mukamel, E.A., Leung, D.,

Rajagopal, N., Nery, J.R., Urich, M.A., Chen, H., et al. (2015). Human body epi-

genome maps reveal noncanonical DNA methylation variation. Nature 523,

212–216.

Sepulveda, J.L., Gutierrez-Pajares, J.L., Luna, A., Yao, Y., Tobias, J.W.,

Thomas, S., Woo, Y., Giorgi, F., Komissarova, E.V., Califano, A., et al.

(2016). High-definition CpG methylation of novel genes in gastric carcinogen-

esis identified by next-generation sequencing. Mod. Pathol. 29, 182–193.

Smith, Z.D., and Meissner, A. (2013). DNA methylation: roles in mammalian

development. Nat. Rev. Genet. 14, 204–220.

Sopko, R., Huang, D., Preston, N., Chua, G., Papp, B., Kafadar, K., Snyder, M.,

Oliver, S.G., Cyert, M., Hughes, T.R., et al. (2006). Mapping pathways and phe-

notypes by systematic gene overexpression. Mol. Cell 21, 319–330.

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gil-

lette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al.

(2005). Gene set enrichment analysis: A knowledge-based approach for inter-

preting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102,

15545–15550.

Therneau, T.M., and Grambsch, P.M. (2000). Modeling Survival Data: Extend-

ing the Cox Model (New York: Springer).

Tripathi, S., Pohl, M.O., Zhou, Y., Rodriguez-Frandsen, A., Wang, G., Stein,

D.A., Moulton, H.M., DeJesus, P., Che, J., Mulder, L.C.F., et al. (2015).

Meta- and Orthogonal Integration of Influenza "OMICs" Data Defines a Role

for UBR4 in Virus Budding. Cell Host Microbe 18, 723–735.

van Rossum, G. (1995). Python Tutorial (the Netherlands: Centre for Mathe-

matics and Computer Science, Amsterdam).





















































































https://www.R-project.org

https://www.R-project.org








































Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., and Luscombe, N.M.

(2009). A census of human transcription factors: function, expression and evo-

lution. Nat. Rev. Genet. 10, 252–263.

Varley, K.E., Gertz, J., Bowling, K.M., Parker, S.L., Reddy, T.E., Pauli-Behn, F.,

Cross, M.K., Williams, B.A., Stamatoyannopoulos, J.A., Crawford, G.E., et al.

(2013). Dynamic DNAmethylation across diverse human cell lines and tissues.

Genome Res. 23, 555–567.

Wang, Y., Fan, P.S., and Kahaleh, B. (2006). Association between enhanced

type I collagen expression and epigenetic repression of the FLI1 gene in

scleroderma fibroblasts. Arthritis Rheum. 54, 2271–2279.

Wang, K., Alvarez, M.J., Bisikirska, B.C., Linding, R., Basso, K., Dalla Favera,

R., and Califano, A. (2009a). Dissecting the interface between signaling and

transcriptional regulation in human B cells. Pac SympBiocomput 14, 264–275.

Wang, K., Saito, M., Bisikirska, B.C., Alvarez, M.J., Lim, W.K., Rajbhandari, P.,

Shen, Q., Nemenman, I., Basso, K., Margolin, A.A., et al. (2009b). Genome-


wide identification of post-translational modulators of transcription factor ac-

tivity in human B cells. Nat. Biotechnol. 27, 829–839.

Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., Tang, Q., Meyer, C.A.,

Zhang, Y., and Liu, X.S. (2013). Target analysis by integration of transcriptome

and ChIP-seq data with BETA. Nat. Protoc. 8, 2502–2515.

Yang, Y., Fear, J., Hu, J., Haecker, I., Zhou, L., Renne, R., Bloom, D., andMcIn-

tyre, L.M. (2014). Leveraging biological replicates to improve analysis in ChIP-

seq experiments. Comput. Struct. Biotechnol. J. 9, e201401002.

Zheng, X., Zhang, N., Wu, H.J., and Wu, H. (2017). Estimating and accounting

for tumor purity in the analysis of DNA methylation data from cancer studies.

Genome Biol. 18, 17.

Zhu, H., Wang, G., and Qian, J. (2016). Transcription factors as readers and ef-

fectors of DNA methylation. Nat. Rev. Genet. 17, 551–565.





























STAR+METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER

Deposited Data

DNA methylation data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga

gene expression data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga

DNA copy number data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga

Somatic mutation data (level 2) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga

Transcription Start Sites genomic

annotation (hg19)

UCSC Table Browser https://genome.ucsc.edu/cgi-bin/hgTables

ChIP-seq peak data ENCODE Consortium https://www.encodeproject.org/

Software and Algorithms

GISTIC 2 Mermel et al., 2011 https://portals.broadinstitute.org/cgi-bin/cancer/

publications/pub_paper.cgi?mode=view&paper_id=

216&p=t

The estimation of the mutual information Kraskov et al., 2004 N/A

Condition mutual information Frenzel and Pompe, 2007 N/A

MINDy algorithm Wang et al., 2009b N/A

Metascape Tripathi et al., 2015 http://metascape.org

R package ‘survival’ Therneau and Grambsch, 2000 https://cran.r-project.org/web/packages/survival/

index.html

MATLAB The MathWorks, Inc. https://www.mathworks.com/products/matlab.html

Python van Rossum, 1995 https://www.python.org/

Numpy Oliphant, 2015 http://www.numpy.org/

Pandas McKinney, 2010 https://pandas.pydata.org/

Matplotlib Hunter, 2007 https://matplotlib.org/

Scipy Millman and Aivazis, 2011 https://www.scipy.org/

Java 8 SE Oracle Corporation https://www.oracle.com/java/technologies/java-se.

html

Scala Odersky et al., 2004 https://www.scala-lang.org/

R R Core Team, 2008 http://www.R-project.org

F# F# Software Foundation https://fsharp.org/

Node.js Node.js Foundation https://nodejs.org/en/

Scikit-learn Pedregosa et al., 2011 http://scikit-learn.org/stable/

GSEA software Subramanian et al., 2005 https://www.broadinstitute.org/gsea/

Other

Code used for this publication This paper Supplementary file 8

TF list This paper Supp file

Cancer-related gene list NCBI (Entrez) Gene Database https://www.ncbi.nlm.nih.gov/gene

CONTACT FOR REAGENT AND RESOURCE SHARING

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Xuerui

Yang ([email protected]).

METHOD DETAILS

Data collection and pre-processingSomatic mutation data of tumors (level 2), gene expression, DNA copy number, and DNAmethylation data (level 3) of tumor and adja-

cent normal samples from 21 cancer types were downloaded from TCGA (Table S1). Each cancer type has at least 150 independent

tumor samples, of which all the 3 types of data (gene expression, DNA copy number, and CpG methylation) are available.

Cell Reports 26, 3461–3474.e1–e5, March 19, 2019 e1


https://tcga-data.nci.nih.gov/docs/publications/tcga




https://genome.ucsc.edu/cgi-bin/hgTables

https://www.encodeproject.org/

https://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=216&p=t



http://metascape.org

https://cran.r-project.org/web/packages/survival/index.html

https://cran.r-project.org/web/packages/survival/index.html

https://www.mathworks.com/products/matlab.html

https://www.python.org/

http://www.numpy.org/

https://pandas.pydata.org/

https://matplotlib.org/

https://www.scipy.org/

https://www.oracle.com/java/technologies/java-se.html

https://www.oracle.com/java/technologies/java-se.html

https://www.scala-lang.org/

http://www.R-project.org

https://fsharp.org/

https://nodejs.org/en/

http://scikit-learn.org/stable/

https://www.broadinstitute.org/gsea/

https://www.ncbi.nlm.nih.gov/gene

Specifically, RNA-seq V2 data (level 3) was used for gene expression profiles of all 21 cancer types. These are read counts of genes

and have been already processed and normalized across all the samples of each cancer type by TCGA with the upper quantile

normalization method.

For the DNA copy number data, we used the Affymetrix Genome-Wide Human SNP Array 6.0 segmentation data provided by

TCGA (level 3). In detail, the ‘‘nocnv.seg’’ files for each sample were collected from TCGA to capture the somatic CNV. GISTIC 2

(Mermel et al., 2011) was then used with default parameters and the ‘‘-savegene’’ option to recover the copy numbers at the

gene level. From the results of GISTIC 2, we used the raw continuous CNV value for each gene for our down-stream analyses.

Throughout the present study, only the promoter CpG sites in the range of ± 2.5kb from the TSSs were considered. Genomic an-

notations of the TSSs were obtained from the UCSC Table Browser (hg19). For genes with multiple TSSs, all the CpG sites falling in ±

2.5kb from any of the TSSs were considered as promoter CpG sites. A small number of CpG sites were found to be close to the TSSs

(within ± 2.5kb) of more than one genes, and therefore, they would be tested for all these corresponding genes. All the DNA methyl-

ation data (beta values) used in the present study were generated by TCGA on the Infinium HumanMethylation450 BeadChip plat-

form, which covered 193,969 CpG sites allocated in the DNA promoter regions of 23,837 genes.

For each cancer type, genes with no or low expression in more than half of the tumor samples (median of normalized read

counts < 10) were discarded. The CpG sites with no methylation readout (beta value) in more than a quarter of the tumor samples

were removed. If any tumor sample bears somatic mutation(s) at a CpG site, the sample was removed from the methylation profile

of this particular CpG site for thewhole study. However, it was also noted that such cases are extremely rare (within each cancer type,

fewer than 0.5% of the promoter CpG sites are mutated in at least one tumor sample). Statistics of the genes and CpG sites that

passed the filters are provided in (Liu et al., 2018).

ChIP-seq data collection and analysisChIP-seq peak files for 161 different TFs were collected from the ENCODE Consortium (Neph et al., 2012b). We only used data from

cancer cell lines in ENCODE, since the current study has been focused on the DNA methylome-mediated regulations in cancers.

Some TFs were profiled by ChIP-seq in multiple cancer cell lines, but mostly no more than 3. In cases of multiple ChIP-seq files

for one TF in one or more cell types, the files of the same TF were merged by taking the largest signal value of the ChIP-seq peaks

at regions where the peaks overlap.

Genome-wide identifications of the TF-target regulatory circuits modulated by CpG site methylationThe overall pipeline for genome-wide identification of the TF-target regulatory circuits that are modulated by CpG site methylation is

illustrated in Figure 1A. The dynamic range of the data is essential for the statistical analysis used in our methodology, so in each

cancer type, we removed the promoter CpG sites with relatively stable methylation levels across tumors (standard deviation of

the beta values < 0.1), as well as poorly (normalized read count < 10) or stably (coefficient of variation of the read counts < 1.5) ex-

pressed genes. Filtering of the promoter CpG sites and the genes has been done and published before (Liu et al., 2018).

To evaluate the overall false discovery rates in the later steps, we generated artificial data entries that are true negatives. Specif-

ically, we randomly picked 10% of the genes and sample-shuffled their promoter CpG methylation and CNV profiles. Their gene

expression profiles were intact without sample permutation. These simulated data entries were spiked-in to the real datasets of

methylation, CNV, and gene expression, accordingly. Theoretically, any positive hit in the later steps involving these artificial CpG

or CNV entries would be a false discovery.

In the present study, for finding specific types of regulation in the usually multiplex and non-linear regulatory relationships, we used

the concepts of mutual information and cMI to assess the associations between multiple factors. We adopted the Kraskov approach

for estimation of the mutual information, which has been widely used in multiple areas (Frenzel and Pompe, 2007; Misra et al., 2015;

Reshef et al., 2011). Because DNA copy number is also a major defining factor of the gene expression, we designed a 2-step pro-

cedure (Figure 1A). The first step collects the TF-target gene associations that are modulated by either CpG methylation, DNA copy

number, or both, and the second step filters the TF-target combinations further in a largely reduced search space for the transcrip-

tional regulation events that involve CpG methylation.

We first put together a list of TFs with gene annotation information from Gene Ontology database, NCBI (Entrez) Gene database,

and several previous publications. More than 1800 genes were finally annotated as TFs or putative TFs. Specifically, in the first step,

for any possible combination of a TF i (TFi), a potential target gene (Genej) j and its DNA copy number (CNVj), as well as a CpG site m

within the promoter region of the target gene (CpGjm), we calculated the cMI between the TF and the target genewith theCpGmethyl-

ation and the DNA copy number of the target gene as the condition: cMIðTFi;Genej��CNVj;CpGjmÞ (Kraskov et al., 2004). The mutual

information (MI) between the TF and the target geneMIðTFi;GenejÞwas also calculated. Because the absolute value of cMI depends

on the MI, it is inappropriate to directly compare the cMIs from different TF-gene combinations. Therefore, we adapted the method-

ology used by theMINDy algorithm (Wang et al., 2009a, 2009b) to derive a P value-estimating function, which evaluates the statistical

significance of the additional information brought by the two conditions (CNVj;CpGjm) as a whole based on both values of MI and cMI.

Briefly, we first randomized the sample orders of the CpG methylation and CNV profiles, and then we calculated the MIðTF;GeneÞand cMIðTF;GenejCNV ;CpGÞ for 10,000 randomly picked TF-gene pairs. These 10,000 cMI/MI entries calculated from the random-

ized data were then sorted and evenly allocated into 100 bins according to the MI values. This therefore generated a null distribution

of the cMIs in each bin for a small range of the MIs. Next, from the real data, each TF-gene pair can be allocated in one of these

e2 Cell Reports 26, 3461–3474.e1–e5, March 19, 2019

100 bins according to the value of MIðTFi;GenejÞ, and then the P value of each cMIðTFi;Genej��CNVj;CpGjmÞ can be estimated by

comparison with the null distribution of the particular bin. Finally, the P value cut-off was set so that only 1% of the cMI entries

involving the true negative spike-ins in the original data could pass through (Figure 1A). Eventually, 10%–20%of all the possible com-

binations ðTFi;Genej��CNVj;CpGjmÞ genome-wide passed this first filter, resulting in a large reduction of the search space for the

following step.

The second step of our pipeline was set to further identify the cMIðTFi;Genej��CNVj;CpGjmÞ entries that significantly depend on the

CpGmethylation (rather than the CNV only). Specifically, the TF-gene-CpG combinations (triplets) after the first filter were challenged

by sample shuffling of the DNA methylation profiles. For each triplet, a null distribution was established from the values of

cMIðTFi;Genej��CNVj;CpGjmÞ calculated with the methylation data that was randomized 1000 times. Due to the high computational

cost for estimations of mutual information during the analysis at such a huge scale, the number of sample permutations was set to

1000 as a balance between computational cost and resolution of FDR estimation. The final P value can therefore be estimated by

comparing the real cMIðTFi;Genej��CNVj;CpGjmÞ with this null distribution. The final P value cut-off was set to be 0.05, i.e., no

more than 50 out of the 1000 sample-permutated cMIs are larger than the cMI from the real data, for each triplet (Figure 1A). This

process ensures that in the final collection of the TF-gene-CpG combinations (triplets), the DNA methylation profile is truly a deter-

mining factor for the cMIðTFi;Genej��CNVj;CpGjmÞ. In fact, with this cut-off, almost all of the 1% true negative spike-ins that passed the

filter in the first step were discarded, suggesting that the P value cut-off is a reasonable choice.

Finally, we suspect that there could be some indirect TF-target associations in the results. Specifically, for example, if TF1

regulates TF2, which directly regulates its target gene and a promoter CpG site was involved, then it is possible that our pipeline

will identify two circuits, TF1-target and TF2-target, both depending on the same CpG site. Here, the TF1-target circuit would be

a false positive (an indirect regulation circuit) and should be removed. Therefore, we went through the results from step 2, and

removed these potential false positives due to indirect TF-target regulations. Approximately 10% of the previous identified circuits

were removed.

The same procedure was performed with the datasets from each of the 21 cancer types, and the results are summarized as 21

cancer context-specific DNA MeTRNs.

Implementation of the pipelineThe whole pipeline consists of several sequential tasks, including 1) data preparation, 2) MI and cMI calculation, 3) filter 1 for the TF-

target-CpG combinations, and 4) filter 2 for the MeTRN triplets. All the computational tasks were implemented with MATLAB, C++,

and java. All the scripts organized as a pipeline can be downloaded from our website (http://labyang.com/data/MeTRN_scripts.rar).

Functional enrichment analysisFor each type of cancer, the genes that are heavily dependent on the DNA methylation-modulated transcriptional circuitry (R2 from

the ordinary linear regression model Target �TF + CpG greater than 0.4; see the data provided in Data S3), and the genes highly

dependent on the CNV (R2 from the model Target �CNV greater than 0.4; see the data provided in Data S3) were selected. Gene

Ontology (GO), Reactome, and KEGG enrichment analyses were conducted with different gene sets using the online tool Metascape

(http://metascape.org). The terms of biological processes that are significantly enriched in the gene sets of each cancer type were

collected and arranged according to their enrichment patterns across different cancer types. The list of cancer-related genes (1265 in

total) was obtained from the NCBI (Entrez) Gene database.

Patient classification and survival analysisIn each MeTRN, the genes that are strongly regulated by the TF-CpG circuits (R2 from the linear combination model Target �TF +

CpG greater than 0.5) were selected, and the CpG sites and the TFs taking parts in the regulatory circuits of each MeTRN were

collected as patient classifiers. The TF expressions (TPM values) were linearly scaled to [0, 1], the same as the DNA methylation

data (beta values). In each cancer type, the methylation profiles of the CpG sites and the expressions of the TFs were used for an

unsupervised hierarchical clustering of the patients. We have used an unsupervised hierarchical tree cutting strategy to determine

the optimal number of clusters k before the survival analysis. Specifically, we used the method of Silhouette-based cutting, which

has been frequently used for determination of cluster numbers based on high-throughput omics data (Cancer Genome Atlas

Network, 2012, 2015; Cancer Genome Atlas Research Network 2012, 2014; Ranzani et al., 2015; Sathyamurthy et al., 2018). In

most of these studies, the numbers of patient subgroups were eventually determined based on the clustering results and with the

help of prior knowledge about the cancer subtypes, prognostic differences, and evidence from other sources. It appears that for

most of the cancers, more than 3 subtypes have been defined by TCGA. Therefore, we set the k to be greater than 3, and used a

Silhouette-based approach to determine the optimal number of subtypes. This generalized strategy for all 21 cancers may not yield

the best result for some particular cancer types. However, the purpose of the clustering analyses was just to show that coupled TFs

and CpG sites can be used to define prognostically different patient subgroups. Cancer type-specific survival analyses were then

performed for each of the patient subgroups resulted from the clustering analysis. The Kaplan–Meier estimator and the log-rank

test method was used for statistical assessments of the survival curves, and correction for multiple testing was done with two stage

linear step-up procedure when applicable method. The R package ‘survival’ was used for comparison of the patient survival rates

among different subgroups.


http://labyang.com/data/MeTRN_scripts.rar

http://metascape.org

Differential expression and differential methylation analysisPaired Student’s t test was used to assess the mRNA differential expression of the genes in different sample groups. Default argu-

ments for t.test() in R were used, which provided estimates of the statistical significance (P value) of the differential expression.

A volcano plot was generated to illustrate the expression fold change and the adjusted P value for each gene.

Wilcoxon signed-rank test (also known as paired Mann-Whitney test) was used to assess the statistical significance of the differ-

ence at the level of methylation, in different groups of tumor samples, for each CpG site. The exact p value was computed instead of a

normal approximation. The medians of methylation differences (tumor – normal) were also computed. A volcano plot was generated

to illustrate the median of the methylation difference and the adjusted P value for each CpG site.

QUANTIFICATION AND STATISTICAL ANALYSIS

Software and statistical analysisDuring the following analysis, we mainly use Python 3 (version 3.6), R (version 3.3) and Scala (version 2.11.8) as the primary tools.

Figure 2

Values are extracted directly from the triplet files (Data S1) of 21 cancer types.

Figure 3

A and B. The number of data points for each box is the line number of the corresponding file in Data S3. The linear regression is done

using Ordinary Least-squares (OLS) (https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) approach in a Py-

thon package ‘‘Statsmodels.’’ Due to the fact that some target genes are regulated by too many TFs and/or CpGs, the OLS method

might fail due to samples are fewer than variables (TFs and CpG sites). If a target gene has invalid status in any of ‘‘Target�TF,’’ ‘‘Tar-

get�CpG,’’ ‘‘Target�TF+CpG,’’ ‘‘Target�TF+CpG+CNV’’ linear regression models, it would be excluded in the figure. In the figure,

the point in a box represents the mean. Points outside the box and whiskers are outliners.

C. We select target genes with the adjusted R-squared value of ‘‘Target�TF+CpG’’ (MeTRN) model greater than 0.4 and submit

them to Metascape (http://metascape.org/) to perform ‘‘Express analysis’’ and retrieved the outcome. A python script is used to

perform row and column reorganization of the result matrix. But the value is not modified.

Figure 4

A. Only cancer-related genes are shown in this figure. The data is from Figure 3A.

B. We sorted cancer-related target genes with the median of their adjusted R-squared value across 21 cancer types in descending

order and selected 10 genes of interest among the top.

C and D. We directly count the triplet file (Data S1) to get the values.

Figure 5

A. Adjusted R-squared values are described in Figure 3. The similarity between cancers are calculated by using R-squared values of

genes presenting in both cancers.

Figure 6

A. The clustering for rows and columns is done by using Euclidean distance andWard linkage function. Gene expressions are normal-

ized to [0, 1] by function f(x) = (x - A) / (B - A) where A and B are minimum and maximum value of the expression values of that gene

across samples. Python package ‘‘Scipy’’ is used to calculate distances and linkages. The clusters are cut based on maximum value

of silhouette with the restriction that cluster should be more than 3 but less than 10. Silhouette is calculated using ‘‘sklearn.metrics.

silhouette_score’’ function from Python package ‘‘scikit-learn.’’

B, C, D. Kaplan-Meier estimator is used to generate survival function. Log-rank test is used to compare the survival distributions

between two clusters. Cluster size: N(k1) = 175, N(k2) = 55, N(k3) = 10, N(k4) = 32. Python package ‘‘lifelines’’ is used to perform the

survival analysis.

Figure 7

A. Differential analysis of CpG sites and genes. For CpG sites, the difference is calculated by subtracting the median of methylation

beta values in samples of cluster k3 (n = 10) by the median of methylation beta values in samples of cluster k1 (n = 175). The fold

change of TFs is calculated by dividing the median of gene expression values of cluster k3 by that of cluster k1. The p values

here are calculated by the Wilcoxon rank-sum test (python package Scipy, method: ‘‘scipy.stats.ranksums’’).

C. GSEA 3.0 from Broad Institute is used to perform gene set enrichment analysis of target genes. We set ‘‘Metric for ranking

genes’’ to ‘‘tTest’’ and ‘‘Collapse dataset to gene symbols’’ to false. All remaining critical arguments for GSEA are untouched.

D. We used Metascape to perform the Gene Ontology term enrichment. The number of genes is shown in C.

Figure S1

A. Violin plot created by Python package ‘‘seaborn.’’ The dashed lines are 75%, 50% and 25%percentiles. The plot is generated with

‘‘cut’’ argument set to 0 to avoid violinplot to extend beyond the range of underlining data.

B. Each point represents a TF. The color of it reflects the number of targets of this TF. The value of the point reflects the percentage

of predicted target genes which are present in ENCODE ChIP-seq data of the corresponding TF.

C. For each TF (point), we performWilcoxon ranksum test to detect the difference of ChIP-seq signal between CpG sites in MeTRN

and those that are not in MeTRN. Due to the sample size is relatively large and the distribution of samples is far from known distri-

butions, we choose this non-parametric hypothesis test to detect the difference.

e4 Cell Reports 26, 3461–3474.e1–e5, March 19, 2019

https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html

http://metascape.org/

Figure S2

A. Each box in the figure contains the same amount of target genes as Figure 3A. Linear regression models with different predictor

variables were used to fit the expression profile of each target gene in the MeTRN for each cancer type. The coefficient of determi-

nation (R2) for each gene was calculated from the model of Target �TFs + CpG sites, Target �TFs, or Target �CpG with data from a

particular cancer type. Finally, boxplots were prepared to show the distributions of the R2 values of all the genes with these 3 different

models for each cancer type.

B. The hypergeometric test is used to calculate the enrichment score. Due to different cancer have a different set of target genes.

the intersection of all possible target genes of both cancer type is acquired first as the population. Then, we treat target genes in one

cancer type as the ‘‘successes,’’ and the size of the target genes of another cancer type as the size of a draw.

C, D. They are generated in the same way as Figure 3C

Figure S3

These figures are drawn in the same way as Figure 3B except that they show the values of target genes from the MeTRNs of other 20

cancer types.

Figure S4-S7

The heatmaps of clustering results and survival analysis are generated in the same way as Figure 6.

Strategy for randomizationIn this study, we used the uniform distribution to randomize the order of samples. No stratification is applied in this study, all samples

of a cancer type are treated equally.

Data exclusionWe remove genes and CpG sites in the dataset if they do not meet specific criteria. Those criteria are described in ‘‘Data collection

and pre-processing’’ in the Method Details section.

We removed target genes that cannot perform ordinary least square linear regression in the R-squared value related analysis.

In survival analysis, we excluded all samples (patients) whose clinical data are unavailable.

DATA AND SOFTWARE AVAILABILITY

All the analysis results and datasets in this article are being provided as supplemental data files (Data S1, S2, S3, S4, and S5). The

algorithms used in this article can be downloaded from http://labyang.com/data/MeTRN_scripts.rar. A web-based searchable data

portal of the MeTRNs is available at http://labyang.com/resources/metrn/table.html.

‘‘Triplets’’ in the MeTRN networks of 21 cancers: Data S1

Target genes of the cancer-related TFs in 21 MeTRNs: Data S2

Genes sorted by the coefficient of determination (R2) from the linear combination model (Target �TF + CpG) or by the R2 from the

model (Target �CNV): Data S3

Dependencies of the cancer genes on the CNV or the DNA methylation-modulated transcriptional regulation circuits in the

MeTRNs: Data S4

Differential expression of the 598 target genes, in k3 versus k1 of KIRP, of the selected TFs and CpG sites in the KIRP MeTRN:

Data S5


http://labyang.com/data/MeTRN_scripts.rar


Cell Reports, Volume 26

Supplemental Information

Dependency of the Cancer-Specific Transcriptional

Regulation Circuitry on the Promoter DNA Methylome

Yu Liu, Yang Liu, Rongyao Huang, Wanlu Song, Jiawei Wang, Zhengtao Xiao, ShengchengDong, Yang Yang, and Xuerui Yang

1

SUPPLEMENTARY FIGURES AND TABLES

Dependency of the cancer-specific transcriptional regulation circuitry on the

promoter DNA methylome

Yu Liu1-4,# , Yang Liu1,3-5,#, Rongyao Huang1,3,4,#, Wanlu Song1-4, Jiawei Wang1,3,4, Zhengtao Xiao1-4,

Shengcheng Dong1,3,4, Yang Yang1,3,4, and Xuerui Yang1-4,*

1MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China

2Tsinghua-Peking Joint Center for Life Sciences, Beijing 100084, China

3Center for Synthetic & Systems Biology, Tsinghua University, Beijing 100084, China

4School of Life Sciences, Tsinghua University, Beijing 100084, China

5Joint Graduate Program of Peking-Tsinghua-National Institute of Biological Science, Tsinghua

University, Beijing 100084, China.

# These authors contributed equally.

* Correspondence should be addressed to X.Y. ([email protected], +86-10-62783943).


2

SUPPLEMENTARY FIGURES

Figure S1

Figure S1. Detailed statistics and cross-validations of the 21 cancer type-specific MeTRNs. Related to Figure

1.

(A) Violin plots, for each cancer type-specific MeTRN, showing distributions of the numbers of target genes of the

TFs. The information about the CpG sites involved in the TF-target gene circuits was not incorporated. Taking the

CpG sites into account, Violin plots show distributions of the numbers of promoter CpG-gene pairs for the TFs. (B)

Validation rates of the TF targets predicted by the MeTRNs. Each dot represents a TF and shows the percentage of

Supplementary Figure 1

A

B

-log10(a

dju

ste

d P

v)

Ratio (log2) of TF-specific ChIP-seq signals on

MeTRN CpG over non-MeTRN CpG sites

C

3

the predicted targets in a MeTRN that are supported by the ChIP-seq data of that particular TF. The color of each

dot represents the number of predicted targets of the particular TF in the MeTRN. (C) Each dot represents a TF and

shows the difference in the ChIP-seq peak signals spanning the promoter CpG sites of the target genes predicted by

the MeTRNs for that particular TF, compared to the peak signals spanning the other promoter CpG sites of the same

groups of genes but not included in the MeTRNs. Log2 of the ratios (average ChIP-seq peak signals spanning the

CpG sites in the MeTRNs over the average signals spanning the sites not in the MeTRNs) and the statistical

significance (-log10 of the P-values with a t-test) of the differences are provided on the X-axis and Y-axis,

respectively.

4

Figure S2


Target~TF Target~ CpG Target~TF+CpG

46

Overlapped genes

-log10(Pv) of the overlap

840

340

0

0

ES

CA

SK

CM

TH

CA

LIH

CL

US

CP

CP

GS

TA

DC

ES

CS

AR

CP

AA

DL

AM

LK

IRC

HN

SC

CO

AD

LG

GU

CE

CK

IRP

LU

AD

BL

CA

BR

CA

TG

CT

cell-substrate adhesion

small GTPase mediated signal transduction

positive regulation of hydrolase activity

actin filament-based process

cell-cell adhesion

negative regulation of cell differentiation

establishment or maintenance of cell polarity

bone development

digestive tract development

positive regulation of kinase activity

cell morphogenesis involved in differentiation

retina development in camera-type eye

T cell activation

regulation of lymphocyte activation

antigen processing and presentation of peptide antigen via MHC class Ib

lymphocyte homeostasis

cranial nerve morphogenesis

cell fate commitment

morphogenesis of an epithelium

regionalization

-log10(Pv)

0 3 6 10 20

C

D

A

B

5

Figure S2. Prediction powers of the MeTRNs for gene expression profiles and comparison with Target ~ CpG

and Target ~ TF models. Related to Figure 3.

(A) Linear regression models with different predictor variables were used to fit the expression profile of each target

gene in the MeTRN for each cancer type. The coefficient of determination (R2) for each gene was calculated from

the model of Target ~ TFs + CpG sites, Target ~ TFs, or Target ~ CpG with data from a particular cancer type.

Finally, box plots were prepared to show the distributions of the R2 values of all the genes with these 3 different

models for each cancer type. (B) The numbers of genes that are heavily regulated via the methylation-dependent

transcriptional machinery (R2 from the linear combination model Target ~ TF + CpG greater than 0.4, from Data S3)

in each cancer type are shown along the diagonal of the rotated matrix. These genes were then compared between

each pair of cancers; the number (upper triangle) and the P-value (lower triangle) of the overlap are given in the

matrix. (C) The biological and physiological processes enriched in the top genes that are highly dependent on the

CpG sites in the MeTRNs (R2 of Target ~ CpG > 0.4). (D) The biological and physiological processes enriched in

the top genes that are highly dependent on the TFs in the MeTRNs (R2 of Target ~ TF > 0.4). Saturation of the color

indicates the statistical significance (-log10(Pv)) of each term.

6

Figure S3

Figure S3. Scatter plots of the coefficient of determinations (R2) from different linear regression models.

Related to Figure 3.

Scatter plots, which are similar to that shown in Fig. 3B, of all the other 20 cancer types.

R2

of

Targ

et~

TF

+C

pG

R2 of Target~CNV

0

0.2

0.4

0.6

0.8

1


7

Figure S4

LG

G

PA

AD

BR

CA

SA

RC

LA

ML

KIR

C

LIH

C

SK

CM

CE

SC

LU

SC

HN

SC

PC

PG

TH

CA

ES

CA

UC

EC

BLC

A

LU

AD

TG

CT

ST

AD

CO

AD

TFs

CpGs

Tumors0 1

Methylation or TF levelk1 k2 k3 k4 k5 k6 k7 k8 k9


8

Figure S4. Cancer type-specific classifications of the patients based on the coupled TFs and CpG sites in the

MeTRNs. Related to Figure 6.

For each of the 20 cancer types, unsupervised hierarchical clustering analysis was performed with the TF expression

and CpG site methylation profiles of the DNA methylation-modulated transcriptional regulation circuits in the

MeTRN that exhibited strong prediction power for the target gene expression profiles, as shown with the linear

combinatory model Target ~ TF + CpG. The method of Silhouette-based cutting was applied to determine the

optimal number of subgroups for each cancer type. The patient subgroups resulted from the clustering analysis were

marked with different colors. The 20 cancers were sorted by the significance levels (P-values) of the prognostic

difference between the subtypes.

9

Figure S5

p(k2,k4)=4.4955e-14

LGG

p(k1,k3)=9.3001e-04

PAAD

p(k2,k3)=1.4740e-03

BRCA

p(k1,k2)=2.1034e-03

SARC

p(k2,k7)=3.2541e-03

LAML

p(k2,k4)=4.7267e-03

KIRC

p(k1,k4)=9.3813e-03

LIHC

p(k3,k4)=1.5047e-02

SKCM

p(k1,k2)=2.1601e-02

CESC

p(k2,k4)=6.5707e-02

LUSC

p(k1,k3)=7.0604e-02

HNSC

p(k2,k4)=1.3760e-01

PCPG

p(k1,k4)=1.6704e-01

THCA

p(k1,k3)=2.0227e-01

ESCA

p(k2,k4)=2.7323e-01

UCEC

p(k1,k2)=2.9161e-01

BLCA

p(k1,k4)=3.1344e-01

LUAD

p(k2,k3)=3.1731e-01

TGCT

p(k3,k4)=3.7629e-01

STAD

p(k1,k2)=6.3286e-01

COAD

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

k1 k2 k3 k4 k5 k6 k7 k8 k9

0 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month

0 10 20 30 40 50 60

Month

Ove

rall

Su

rviv

al

Ove

rall

Su

rviv

al

Ove

rall

Su

rviv

al

Ove

rall

Su

rviv

al

Ove

rall

Su

rviv

al


10

Figure S5. Survival curves of the cancer subtypes identified by the MeTRN regulators. Related to Figure 6.

Cancer type-specific Kaplan–Meier survival curves showing comparisons of overall survival between different

subgroups of patients, which were classified based on the MeTRN regulators and shown in Fig. S4. P-values for the

statistical significance of the largest prognosis difference in each cancer were inferred with log-rank tests. The 20

cancers were organized in the same order as shown in Fig. S4.

11

Figure S6


p(k1,k3)=1.0353e-11

LGG PAAD BRCA

p(k1,k5)=2.7874e-03

SARC

p(k1,k2)=1.6172e-01

LAML

p(k2,k4)=3.4149e-04

KIRC

p(k2,k4)=3.3632e-01

LIHC SKCM

p(k2,k4)=2.1546e-02

CESC

p(k2,k4)=7.3213e-02

LUSC HNSC

p(k2,k3)=5.2358e-01

PCPG

p(k1,k4)=1.8808e-01

THCA ESCA UCEC

p(k2,k4)=3.7358e-02

BLCA

p(k1,k2)=1.1829e-01

LUAD

p(k1,k2)=2.8350e-01

TGCT

p(k1,k4)=2.1362e-01

STAD

p(k1,k2)=1.1193e-01

COAD

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

Month0 10 20 30 40 50 600 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month

k1 k2 k3 k4 k5 k6 k7 k8 k9

p(k3,k4)=1.4078e-03 p(k1,k2)=7.3507e-02

p(k3,k4)=4.4643e-01

p(k2, k3)=7.0668e-02

p(k1,k2)=5.0251e-01 p(k1,k2)=4.4931e-01

12

Figure S6. Survival curves of the cancer subtypes identified by the top variable CpG sites. Related to Figure

6.


subgroups of patients, which were classified based on the most highly variable CpG sites. P-values for the statistical

significance of the largest prognosis difference in each cancer were inferred with log-rank tests. The 20 cancers were

organized in the same order as shown in Fig. S4.

13

Figure S7


p(k1,k4)=2.5282e-11

LGG

p(k1,k4)=3.1144e-04

PAAD

p(k1,k2)=1.1346e-02

BRCA SARC

LAML KIRC

p(k1,k4)=9.3607e-02

LIHC

p(k1,k2)=1.6055e-02

SKCM

p(k1,k2)=7.3657e-02

CESC

p(k2,k4)=2.2146e-02

LUSC

p(k1,k4)=3.6606e-02

HNSC

p(k3,k4)=1.3167e-01

PCPG

p(k1,k2)=5.5357e-02

THCA ESCA UCEC BLCA

LUAD

p(k3,k4)=2.4150e-01

TGCT

p(k3,k4)=3.4325e-01

STAD COAD

Month0 10 20 30 40 50 600 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month0 10 20 30 40 50 60

Month

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

0

0.2

0.4

0.6

0.8

1

Ove

rall

Su

rviv

al

k1 k2 k3 k4 k5 k6 k7 k8 k9

p(k3,k4)=1.5175e-02

p(k3,k4)=9.4658e-02 p(k2,k4)=3.5353e-03

p(k2, k4)=1.0985e-01 p(k3, k4)=1.9856e-01

p(k2, k3)=7.9838e-02p(k3, k4)=1.2537e-01 p(k1, k2)=6.3928e-01

14

Figure S7. Survival curves of the cancer subtypes identified by TFs. Related to Figure 6.


subgroups of patients, which were classified based on expressions of TFs with strong prediction powers for target

gene expression. The TF-target associations were inferred with ARACNE, which does not take into account the

DNA methylome profiles. P-values for the statistical significance of the largest prognosis difference in each cancer

were inferred with log-rank tests.

15

SUPPLEMENTARY TABLES

Table S1. Sample numbers of the datasets from 21 cancer types in TCGA. Related to Figure 1.

Tumor samples

Cancer Type* Me mRNA CNV Overlap SNP

BLCA bladder urothelial carcinoma 411 406 407 402 396

BRCA breast invasive carcinoma 737 1099 1086 726 987

CESC

cervical squamous cell

carcinoma and endocervical

adenocarcinoma

309 306 297 294 194

COAD colon adenocarcinoma 285 286 454 265 268

ESCA esophageal carcinoma 186 185 185 184 184

HNSC head and neck squamous cell

carcinoma 530 516 524 510 509

KIRC kidney renal clear cell carcinoma 320 532 529 312 417

KIRP kidney renal papillary cell

carcinoma 276 291 288 272 161

LAML acute myeloid leukemia 194 173 191 163 0

LGG brain lower grade glioma 530 528 527 525 530

LIHC liver hepatocellular carcinoma 379 373 372 366 198

LUAD lung adenocarcinoma 451 513 518 441 543

LUSC lung squamous cell carcinoma 359 501 501 356 177

PAAD pancreatic adenocarcinoma 185 179 185 178 147

PCPG pheochromocytoma and

paraganglioma 184 184 166 166 183

SARC sarcoma 245 261 261 237 259

SKCM skin cutaneous melanoma 465 474 471 456 364

STAD stomach adenocarcinoma 395 415 441 370 379

TGCT testicular germ cell tumors 156 156 150 150 155

THCA thyroid carcinoma 515 513 512 510 405

UCEC uterine corpus endometrioid

carcinoma 432 174 540 172 248

Total: 7544 8065 8605 7055 6704

*Note: 12 out of 33 types of cancer in TCGA were not included in the present study due to limited numbers of

tumor samples with data required for the analyses. These missing cancers are ACC, CHOL, DLBC, KICH, GBM,

MESO, OV, PRAD, READ, THYM, UCS, UVM.

16

Table S2. Statistics of the 21 cancer context-specific MeTRNs. Related to Figure 1.

Cancer

Type Triplets TFs CpG sites Target genes TF-Target

CpG-

Target

BLCA 55163 1283 16701 5029 31773 17053

BRCA 67174 1363 20042 5660 42036 20545

CESC 74993 1301 20869 6068 48104 21481

COAD 47030 1375 10174 4288 43741 10300

ESCA 36594 1184 13372 5242 33587 13628

HNSC 66102 1346 19707 5695 32924 20248

KIRC 39416 1372 13759 5095 32791 14177

KIRP 54481 1321 15511 5252 38913 16021

LAML 25233 1263 10293 4384 23829 10583

LGG 57696 1380 19061 6340 36272 19725

LIHC 64274 1260 17635 5245 40498 18015

LUAD 58763 1341 16782 4990 35213 17181

LUSC 68522 1375 12064 4778 62270 12322

PAAD 32450 1415 13056 4434 23320 13300

PCPG 38298 1367 11721 4899 36445 11900

SARC 85445 1286 22115 6453 55765 22816

SKCM 89500 1291 19585 5611 52883 20162

STAD 78465 1354 24114 6233 47005 24755

TGCT 88692 1415 21927 6434 86890 22457

THCA 52042 1369 15116 5988 48378 15689

UCEC 49578 1297 17154 5230 32666 17584

dependency of the cancer-specific ... - ars.els-cdn.com · machineries in cancers authors yu liu,...

Documents