dependency of the cancer-specific ... - ars.els-cdn.com · machineries in cancers authors yu liu,...
TRANSCRIPT
Resource
Dependency of the Cance
r-Specific TranscriptionalRegulation Circuitry on the Promoter DNAMethylomeGraphical Abstract
DNA methylation
mRNA expression
DNA copy number
, ,,
MeTRNDNA Methylation-dependent
Transcription Regulatory Networks21 Cancers in TCGA
Highlights
d An analysis pipeline based on information theory for TCGA
cancer multi-omics data
d Genome-wide surveys of promoter CpG sites in modulating
transcription in cancers
d Transcription factors and CpG sites coupled in determining
gene expression dynamics
d Resource for dissecting the gene expression dysregulation
machineries in cancers
Liu et al., 2019, Cell Reports 26, 3461–3474March 19, 2019 ª 2019 The Author(s).https://doi.org/10.1016/j.celrep.2019.02.084
Authors
Yu Liu, Yang Liu, Rongyao Huang, ...,
Shengcheng Dong, Yang Yang,
Xuerui Yang
In Brief
Using an analysis pipeline based on
information theory and tailored for cancer
multi-omics data in TCGA, Yu et al.
conducted genome-wide surveys of DNA
promoter methylome in modulating
transcriptional regulation circuits in
cancers. Results serve as a resource for
dissecting gene expression
dysregulation in cancer.
Cell Reports
Resource
Dependency of the Cancer-Specific TranscriptionalRegulation Circuitry on the Promoter DNAMethylomeYu Liu,1,2,3,4,6 Yang Liu,1,3,4,5,6 Rongyao Huang,1,3,4,6 Wanlu Song,1,2,3,4 Jiawei Wang,1,3,4 Zhengtao Xiao,1,2,3,4
Shengcheng Dong,1,3,4 Yang Yang,1,3,4 and Xuerui Yang1,3,4,7,*1MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China2Tsinghua-Peking Joint Center for Life Sciences, Beijing 100084, China3Center for Synthetic & Systems Biology, Tsinghua University, Beijing 100084, China4School of Life Sciences, Tsinghua University, Beijing 100084, China5Joint Graduate Program of Peking-Tsinghua-National Institute of Biological Science, Tsinghua University, Beijing 100084, China6These authors contributed equally7Lead Contact*Correspondence: [email protected]
https://doi.org/10.1016/j.celrep.2019.02.084
SUMMARY
Dynamic dysregulation of the promoter DNA methyl-ome is a signature of cancer. However, comprehen-sive understandings about how the DNA methylomeis incorporated in the transcriptional regulation cir-cuitry and involved in regulating the gene expressionabnormality in cancers are still missing.We introducean integrative analysis pipeline based on mutual in-formation theory and tailored for the multi-omicsprofiling data in The Cancer Genome Atlas (TCGA)to systematically find dependencies of transcrip-tional regulation circuits on promoter CpG methyl-ation profiles for each of 21 cancer types. Bycoupling transcription factors with CpG sites, thiscancer type-specific transcriptional regulation cir-cuitry recovers a significant layer of expression regu-lation for many cancer-related genes. The coupledCpG sites and transcription factors also serve asmarkers for classifications of cancer subtypes withdifferent prognoses, suggesting physiological rele-vance of such regulation machinery recapitulatedhere. Our results therefore generate a resource forfurther studies of the epigenetic scheme in geneexpression dysregulations in cancers.
INTRODUCTION
DNA methylation of CpG dinucleotides has been shown to play
critical roles in pluripotency, development, and various diseases
(Robertson, 2005; Smith and Meissner, 2013). Alteration of the
methylation status at promoter CpG sites is associated with
expression dysregulation of many cancer-related genes and is
therefore recognized as one of the major driving factors of tumor
initiation and development (Chatterjee and Vinson, 2012; Jones,
2012). Promoter DNA methylation is generally considered as a
potent epigenetic repressor of gene transcription by blocking
the recruitment of transcription factors (TFs) (Blattler and Farn-
ham, 2013; Domcke et al., 2015; Jones, 2012), while recent
Cell RThis is an open access article under the CC BY-N
studies have also uncovered many TF binding events that
actually depend on methylated CpG (mCpG) (Hu et al., 2013;
Liu et al., 2012; Zhu et al., 2016). These studies were focused
on detailed machineries of specific genes being regulated by
CpG methylation-sensitive TFs in various contexts. However,
genome-wide coupling of specific functional promoter CpG sites
with the transcriptional regulation circuits from TFs to target
genes has not been systematically revealed yet. Given the large
number of promoter CpG sites, of which the functions are mostly
unknown, systematic inference of the transcriptional regulatory
networks that take into account the potential modulatory func-
tion of the promoter DNA methylome would generate a compre-
hensive view of the epigenetic scheme of gene transcription
regulation and an insightful resource for further mechanistic
studies.
Transcription regulatory networks (TRNs), which are
composed of regulatory circuits from TFs to their target genes,
provide detailed maps of gene expression regulations in specific
cellular contexts (Vaquerizas et al., 2009). Various experimental
(Hu et al., 2007; Johnson et al., 2007; Neph et al., 2012a; Sopko
et al., 2006) and computational (Basso et al., 2005; Marbach
et al., 2012; Margolin et al., 2006; Wang et al., 2009b) methodol-
ogies have been developed to assemble the transcription regu-
latory networks in different contexts. Recently, strategies of inte-
grating multi-source information have also been used for
systematic recovery of the context-specific gene regulatory
repertoire (Budden et al., 2014; Jiang et al., 2015; Li et al.,
2014; Wang et al., 2013). Although previous experimental and
integrative strategies revealed highly valuable and biologically
relevant insights into the gene regulatory logic, they relied sub-
stantially on availability and quality of regulation profiling data,
such as chromatin immunoprecipitation sequencing (ChIP-
seq), DNaseI sequencing (DNaseI-seq), and histone modifica-
tion. On the other hand, computational de novo reverse engi-
neering of the transcription networks usually uses mathematical
and statistical approaches to uncover dependencies between
TFs and targets from gene expression datasets (Basso et al.,
2005; Marbach et al., 2012; Margolin et al., 2006; Wang et al.,
2009b). These methods generated genome-wide surveys of
TF-target associations, which are very useful for systematically
understanding the context-specific transcriptional regulation
eports 26, 3461–3474, March 19, 2019 ª 2019 The Author(s). 3461C-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
A
B C
Figure 1. Genome-wide Identifications of
the TF-Target Gene Regulatory Circuits
Modulated by Promoter CpG Methylation
Levels
(A) Schematic description of the methodology for
search of the TF-target gene circuits that depend
on the level of methylation at specific promoter
CpG sites. Taking a TF i (TFi ), a potential target
gene j (Genej ) and its DNA copy number (CNVj ), as
well as a CpG sitem within the promoter region of
the target gene j (CpGjm) as examples, the diagram
shows how the multi-omics data were used to test
whether the TF-target gene association depends
on promoter CpG methylation. Simulated data
entries of true negatives were added into the
original datasets and used to estimate false dis-
covery rates at each step of the pipeline. Details of
the pipeline are provided in the Method Details
section.
(B and C) Examples showing positive (B) and
negative (C) correlation patterns between the
methylation levels at promoter CpG sites and the
gene expression profiles in tumors of BLCA.
programs, especially when the high-throughput experimental
methods are not applicable, such as when dealing with clinical
samples from limited source. However, most of these previous
efforts did not take into account the information of DNA methyl-
ation for de novo reconstruction of the cancer type-specific tran-
scritpion regulatory networks, despite extensive findings about
the interplay between DNA methylation and TFs in determining
the gene expression profile. This is partially due to lack of suit-
able data and specially designed analysis tools.
In this study, to systematically assess the potential involve-
ment of DNA methylation in gene transcriptional regulation, we
performed an integrative analysis of multi-omics data, including
genome-wide mRNA expression, DNA copy number variation
(CNV), and CpG methylation data for 21 cancer types that
have at least 150 independent tumor samples from The Cancer
Genome Atlas (TCGA). Specifically, we developed a mutual in-
3462 Cell Reports 26, 3461–3474, March 19, 2019
formation-based algorithmic pipeline to
screen for TF-target transcriptional regu-
lation circuits that depend on the methyl-
ation levels of specific promoter CpG
sites for each cancer type (Figure 1A).
Cancer type-specific DNA methylation-
dependent transcription regulatory net-
works (MeTRNs) were assembled from
these circuits for 21 types of cancer and
then cross validated with ChIP-seq data
of 161 TFs. We further showed that the
expression dynamics of many cancer-
related genes can be largely attributed
to these epigenetically modulated tran-
scriptional regulatory circuits in the can-
cer type-specificMeTRNs. Finally, inmul-
tiple cancer types, the functional CpG
sites coupled with the TFs in the MeTRNs
serve as effective classifiers for prognos-
tically different patient subgroups. We believe that such compre-
hensive collections of the context-specific transcriptional regu-
latory circuits and the modulatory CpG sites will be a resource
for dissecting the driving forces of gene expression dysregula-
tion in tumors and understanding the transcriptional regulation
machinery underlying cancer.
RESULTS
Identification of the DNA Methylation Level-DependentTranscriptional Regulation CircuitsPromoter DNA methylation at CpG sites has been recognized as
a critical modulator of gene transcription by interfering with TF
binding (Hu et al., 2013; Jones, 2012; Liu et al., 2012). Using
the gene expression and promoter CpG methylation profiles
of the tumor samples from 21 major cancers in TCGA (see
Table S1 for a summary of the samples) (Liu et al., 2018), we per-
formed a comprehensive survey of the correlations between
gene expression and promoter CpG methylation levels across
tumors for each of the 21 cancers (statistics of the CpG-gene
pairs are supplied in Liu et al., 2018). As expected, many of the
promoter CpG-gene pairs showed medium to high levels of cor-
relations between the CpG methylation and gene expression
profiles (Liu et al., 2018), and two examples are shown in Figures
1B and 1C.
On the basis of our current understandings about the function
of promoter CpG methylation in meddling binding of TFs to
target genes, we set a goal to systematically identify the
methylation level-dependent transcriptional regulation circuits
in different cancers, on a genome-wide scale and at the sin-
gle-CpG site level. Using well-designed tools in the information
theory (Kraskov et al., 2004; Wang et al., 2009a, 2009b), we
developed an analysis pipeline (Figure 1A) that takes advantage
of the highly dynamic tumor multi-omics data in TCGA to
assess the involvement of specific promoter CpG sites in deter-
mining the association between TFs and their potential target
genes for each cancer type. Specifically, for each type of can-
cer, we calculated the conditional mutual information (cMI) be-
tween expression profiles of a gene and a candidate TF in the
tumor samples, given the methylation profile of a promoter
CpG site of this gene, i.e., cMI (TF, Gene j CpG). This quantifies
the additional predictive information that the methylation status
of this CpG site provides toward the gene expression that is
transcriptionally driven by the TF. In other words, the cMI is
an assessment of the dependency of a transcriptional regula-
tion on the methylation level of a CpG site. In practice, because
the DNA copy number is also a major defining factor of gene
expression, we incorporated the CNV data to preclude the ef-
fect of DNA copy number when quantifying the modulating
function of CpG site methylation on transcriptional regulations.
As shown by the analysis pipeline in Figure 1A, the cMI for each
possible combination of a gene j, a CpG site m in the promoter
region, and a TF i, which forms a triplet (i, j, m), was calculated
and tested for statistical significance through two steps. The
first step was implemented to search for the candidate TF-
target circuits that depend on promoter CpG methylation
and/or DNA copy number, and the second step further identi-
fied the CpG methylation-dependent TF-target associations
from the list narrowed down by the first step. Both steps used
the strategy of sample shuffling to estimate the real false dis-
covery rate (FDR) by comparing the calculated cMIs with null
distributions of cMIs generated by multiple types of sample
permutations.
In addition, to further estimate the final FDRs of our pipeline,
we added spike-in data entries at the very beginning of the
analysis pipeline (Figure 1A). Specifically, negative data of
randomly selected genes was simulated by sample permuta-
tion and added into the real datasets as inputs of our pipeline.
These entries, by definition, are true negatives. We calibrated
the cut-offs of the two major steps in our pipeline by control-
ling the discovery rates of these true negatives. Specifically,
the first step recovered fewer than 5% of these true negatives,
which entered into the second step. More conservative FDR
cut-offs were used for the second step, which eventually
removed all the simulated true negatives. See the Method De-
tails section for more details.
Eventually, for each type of cancer, the genome-wide screen
yielded a collection of triplets, of which the statistical signifi-
cance of the cMI passed the filters shown in Figure 1A (Data
S1). These triplets, representing the TF-target transcriptional
regulation circuits that depend on the DNA methylation levels
of specific CpG sites, were summarized as a network, named
as DNA MeTRN, for each of these 21 types of cancer (Data
S1). Basic statistics of these networks are summarized in
Table S2 and Figure S1A. On average, each network involves
more than 1,000 TFs regulating �5,000 target genes, which
are modulated by �16,000 CpG methylation sites. A web-
based search portal has been built for the MeTRNs (http://
labyang.com/resources/metrn/table.html), which allows users
to search for the MeTRN circuits in one or more cancers based
on inputs of TF(s), target gene(s), or promoter CpG site(s) of
interest.
Systematic Cross-Validations and Examples of theMeTRN CircuitryGiven the large scales of the MeTRNs, we used 161 TF-specific
ChIP-seq datasets for systematic validations of the transcrip-
tional regulation circuits in MeTRNs. The datasets for 161 TFs
in various cell contexts, which are mostly cell lines, were
collected from the Encyclopedia of DNA Elements (ENCODE)
project (Neph et al., 2012b). Each dot in Figure S1B represents
a TF and shows the percentage of the predicted targets in a
MeTRN that are supported by the ChIP-seq data of that partic-
ular TF. For each cancer type, validation rates of the predicted
targets of approximately 100 TFs can reach higher than 90%
or lower than 10% with a median of approximately 40% (Fig-
ure S1B). The apparently low validation rates for some TFs are
not surprising, given that the ChIP-seq datasets were generated
just in one or two cell lines, and transcriptional regulation has
been well recognized for its high context specificity. More impor-
tantly, most of the ChIP-seq experiments and the peak-calling
procedures were optimized to control for false positives with
costs of high false negatives. In fact, even for ChIP-seq datasets
of the same TF but from different biological replicates or different
cell types, their shared ChIP-seq peaks can be as low as 10%
and usually will not be higher than 60% (data not shown), which
is also supported by similar observations in literature (Yang et al.,
2014).
In addition, it is worth noting that MeTRN circuitry by definition
is not a comprehensive prediction of all the targets for each TF.
Instead, MeTRN enriches and captures the TF-target circuits
that depend on promoter CpG site(s), while the TF ChIP-seq sig-
nals do not differentiate CpG-dependent and independent TF-
binding targets. Since DNA methylome is known for its
context-dependency, the CpG-dependent TF-target circuits
presumably should be more dependent on the tissue context
than other regular TF-target circuits are. Therefore, it is highly
possible that the ChIP-seq signals could miss many of the
MeTRN targets.
Nevertheless, to further assess involvements of the CpG
sites in the transcriptional regulatory circuits in the MeTRNs,
for each TF, we compared the ChIP-seq peak signals around
Cell Reports 26, 3461–3474, March 19, 2019 3463
A
B
C
Figure 2. Target Genes of 3 TFs, JUNB,
STAT1, and JUND, which Are Present in the
MeTRNs
(A–C) Target genes of 3 TFs, JUNB (A), STAT1 (B),
and JUND (C), which are present in the MeTRNs of
at least 3 cancer types. The biological functions and
processes enriched in the target gene lists were
provided, and the genes annotated to an enrich-
ment term were shown by a line connecting the
term and the gene. Target genes supported by TF-
specific ChIP-seq data were highlighted with blue
outline.
the predicted CpG sites of the target genes with the peaks
around the other promoter CpG sites of the same genes but
not included in the MeTRNs. Analyses of the ChIP-seq data for
161 TFs showed that, in general, the modulatory CpG sites
recovered by the MeTRNs for each TF do have stronger ChIP-
seq peak signals, i.e., higher probabilities of TF binding, than
the other CpG sites that are not in the MeTRNs (Figure S1C).
Taken together, the ChIP-seq data of these 161 TFs offered
high-throughput cross validations of the MeTRNs and
genome-wide assessments of the MeTRNs in recapitulating
the transcriptional regulation circuits that depend on DNA
methylation levels at specific CpG sites.
Finally, to show examples of the regulatory circuits encoded
by the MeTRNs, we focused on the known cancer-related TFs
(more than 200, sorted by median of the target numbers in 21
MeTRNs; Data S2) and looked at their regulatory targets in the
MeTRNs. Three TFs (JUNB [OMIM: 165161]), STAT1 [OMIM:
600555], and JUND [OMIM: 165162]), which have ChIP-seq
data and are among the top 10 TFs with the largest numbers
3464 Cell Reports 26, 3461–3474, March 19, 2019
of target genes, were selected as exam-
ples and their targets in the MeTRNs of
at least 2 cancer types were shown in Fig-
ure 2. The target genes supported by TF-
specific ChIP-seq data were highlighted
in the figures. Next, functional enrichment
analyses of these target genes were per-
formed. For example, the target genes of
JUNB are generally enriched in such pro-
cesses as mitogen-activated protein ki-
nase (MAPK) signaling pathway, NF-kB
pathway, inflammatory response, and
cancer pathways (Figure 2A). The targets
of STAT1 were enriched in the processes
such as ARF6 trafficking, cell-cell adhe-
sion, and metabolism (Figure 2B). The tar-
gets of JUND were generally enriched in
development, signaling pathways, meta-
bolism, and angiogenesis (Figure 2C).
Most of these TF-target regulatory circuits
have not been previously investigated in
detail, although many of them are indeed
supported by TF-specific ChIP-seq data.
These target gene sets of the TFs recov-
ered by the MeTRNs shed lights on poten-
tial down-stream functions of the TFs, which represents a value
of the MeTRNs as a resource for extracting biological insights
and generating testable hypotheses.
Prediction Power of the MeTRNs for Gene ExpressionProfiles in CancersThemRNA expression of a gene is subjected to multiple levels of
regulation, including transcriptional regulations at the DNA level
and post-transcriptional regulation. As previously discussed, the
MeTRNs serve as maps of cancer type-specific transcriptional
regulation circuitry that is dependent on the methylation levels
of specific promoter CpG sites. To estimate the weight of these
regulatory effects in the overall regulation of each particular gene
in a given cancer context, we applied regression models for the
gene expression profiles. Specifically, asmutual information was
used in our pipeline to assess dependency, from which the
MeTRNs were inferred, we used a different model to assess
the prediction powers of the TF-CpG regulation circuits in
MeTRNs to avoid circular definition. Since the ground truth of
A
B C D
Figure 3. Prediction Powers of the MeTRNs for Gene Expression Profiles(A) Linear regression models with different predictor variables (Target�TF+CpG sites, Target�CNV, and Target�TF+CpG sites + CNV) were used to fit the
expression profile of each target gene in the MeTRN for each cancer type. The coefficient of determination (R2) for each gene was calculated from these models
with data from a particular cancer type. Boxplots were prepared to show the distributions of the R2 values of all the genes with these 3 different models for each
cancer type. Whiskers extend to 1.5 times of interquartile range.
(B) A scatterplot example (BLCA) showing the R2 values of each gene from the two regression models (Target�TF+CpG sites and Target�CNV). The dots are
colored by the R2 values of each gene from the combinatory linear regression model of Target�TF+CpG+CNV. Several diagonals were marked by gray lines, on
which the x axis value plus the y axis value of the dots are the same. The same scatterplots for the other cancer types are provided in Figure S3.
(C and D) The biological and physiological processes enriched in the top genes that are highly dependent on the TF-CpG circuits in the MeTRNs (R2 of
Target�TF+CpG > 0.4; C) and the processes enriched in the top genes that are highly dependent on the CNV (R2 of Target�CNV > 0.4; D). Saturation of the color
indicates the statistical significance (–log10(Pv)) of each term.
joint regulation between TF and DNA methylation on target gene
expression is still unclear, we chose linear regression model as a
convenient tool for a rough estimation of the prediction powers.
The expression profile of each target gene in a cancer type-spe-
cific MeTRN was modeled with a linear combination of the
methylation levels and expressions of the gene’s regulating
CpG sites and TFs, respectively, as predicted by the MeTRN
(Target�TF+CpG). From these linear regression models, the co-
efficient of determination (R2) for each gene in the MeTRN was
calculated and adjusted for the number of explanatory terms
as an estimation of how much the gene expression could be
determined by the CpG-involved TF-target transcriptional regu-
lation machinery mapped by the MeTRN of each cancer type
(Data S3). Boxplots of these adjusted R2’ values for all the target
genes in eachMeTRN (Figure 3A, Target�TF+CpG) showed that
the combinations of coupled TFs and CpG sites in each MeTRN
have highly variable prediction powers for different genes.
Importantly, the overall R2 distributions from the combinatory
model of the coupled TFs and CpG sites are significantly higher
than the R2 values from themodel of just the CpG sites or just the
TFs (Figure S2A), indicating a cooperative expression-deter-
mining power of the coupled TFs and CpG sites in the MeTRNs.
Cell Reports 26, 3461–3474, March 19, 2019 3465
Finally, it is worth noting that while a good fit of the linear model
(high R2) indicates a strong determination potential, such linear
regression models actually generated relatively conservative es-
timations of the prediction powers, because thesemodels would
underestimate the non-linear association patterns. Therefore, in
our results, although high R2 values do indicate strong linear pre-
diction power, low R2 values do not necessarily indicate lack of
prediction power. They may be simply due to non-linearity of
the association between the target gene and the determining
factors.
Next, to compare the prediction power of the combined CpG
sites and TFs mapped by the MeTRNs with another major gene
expression driver, the DNA copy numbers, we also used linear
regression models to estimate the coefficient of determination
(R2) of the CNV for each gene (Target�CNV in Figure 3A, and
detailed R2 values in Data S3). In most of the cancers, the
range of the R2 values of the model Target�TF+CpG are
generally similar to or higher than the results from the model
Target�CNV, which suggests that the DNA methylation-depen-
dent transcriptional regulation mapped in the MeTRNs is an
important expression regulation machinery that is as potent
as the CNV.
All the comparisons above were based on overall distributions
of the R2 values. For a more specified comparison between the
prediction powers of the DNA methylation-dependent transcrip-
tional regulation and the DNA copy number for each gene, we
prepared scatterplots showing the R2 (Target�TF+CpG) and
the R2 (Target�CNV) of each gene for each cancer type (bladder
urothelial carcinoma [BLCA] in Figure 3B as an example, and all
the remaining 20 cancers are shown in Figure S3). Clearly, in
each cancer type, some genes appear to heavily rely on the
TF-CpG regulation circuits in the MeTRN, while some appear
to be primarily determined by the CNV (Figures 3B and S3). In
addition, most of the genes fall in the lower-left triangle of the
plots, suggesting that genes heavily depending on DNA methyl-
ation-modulated transcriptional regulation tend not to be highly
associated with the CNV, and vice versa. Therefore, these two
types of regulations are complementary to each other for deter-
mination of the general mRNA expression program.
Finally, another set of linear regression models that combine
all the factors discussed above were used to fit the gene expres-
sion profiles in each cancer type (Target�TF+CpG+CNV)
(Figure 3A). Interestingly, colors of the dots in Figures 3B and
S3, which indicate the R2 values from the combinatory model
(Target�TF+CpG+CNV), are largely consistent along the diago-
nals (illustrated by gray lines). This pattern means that on a per-
gene base, the combinatory R2 is generally equivalent to a sim-
ple summation of the two R2’s from the transcription-only model
(Target�TF+CpG) and the CNV-only model (Target�CNV). In
other words, the two types of factors (TF-CpG combination
and CNV) are independent of each other for determining the
expression profiles of specific genes.
Genes that are heavily regulated via themethylation-dependent
transcriptional machinery (R2 from the model Target�TF+CpG
greater than 0.4) have different percentages of overlaps across
different cancer types (Figure S2B). Interestingly, the biological
and physiological processes enriched for these top genes (R2
(Target�TF+CpG) greater than 0.4) have relatively large overlaps
3466 Cell Reports 26, 3461–3474, March 19, 2019
across different cancer types (Figure 3C), and many of them
havebeen shown tobehighly related to tumorigenesis andcancer
development, for example, cell adhesion, cytoskeleton organiza-
tion, development, differentiation, signaling pathways in cancer,
and regulationof cell death.Note that thegeneswithhighadjusted
R2 in the simple model of Target�CpG or Target�TF showed
much fewer and different functional enrichments (Figures S2C
and S2D). This outcome suggests that the MeTRN circuits of
coupled TFs andCpG sites, instead of theCpG sites or TFs alone,
are main contributors of gene expression regulation related to
some key cancer processes. In contrast, generally fewer and
different biological processes are enriched in genes that are
dependent on the CNV (R2 from the model Target�CNV greater
than 0.4) in each cancer (Figure 3D). These processes include
some very general terms such as RNA or DNAmetabolism,modi-
fication, andmaturation, and theyhave limitedoverlapsacross the
21 cancer types, and (Figure 3D).
Regulation of Cancer-Related Genes through theMeTRNsWe believe that the MeTRNs serve as a rich resource that may
shed lights on some key gene expression regulation machin-
eries, for example, which regulators could be involved in deter-
mining the expression dynamics of the known cancer-related
genes in different cancers. Focusing on these cancer-related
genes, we summarized their dependencies on the CNV and/or
the DNA methylation-dependent transcriptional regulation cir-
cuits in the MeTRNs for different cancers (Figure 4A; Data S4).
Some of the cancer genes showed consensus across different
cancers in their dependencies on the CNV or theMeTRNcircuits.
Figure 4B listed the top 10 cancer-related genes that are highly
dependent on the MeTRN circuits. Note that the R2 values of
these 10 genes from the linear models with CpG sites only
(Target�CpG) or TFs only (Target�TF) are much lower than
those from the TF+CpGmodels in the same cancers (Figure 4B),
which again suggests the combinatory prediction power of the
coupled TFs and CpG sites in the MeTRNs.
For example, the top gene, FLI1 (OMIM: 193067), is strongly
dependent on the MeTRN in 16 of 21 cancers (Figure 4B; Data
S4). Interestingly, as a TF and a proto-oncogene, FLI1 has
been found to be strongly dysregulated via its promoter DNA
methylation levels in various diseases, including the autoimmune
disease scleroderma (Wang et al., 2006), leishmaniasis (Almeida
et al., 2017), andmultiple cancers, such as gastric and colorectal
cancers (Lin et al., 2015; Sepulveda et al., 2016). Therefore, it is
not surprising that the promoter CpG sites, together with specific
TFs, play a major role in determining the expression of FLI1 in
multiple cancers. Furthermore, the MeTRN circuitry provides a
detailed framework of which CpG sites were coupled with spe-
cific TFs in executing such regulatory function. Figure 4C illus-
trated all the TF-CpG circuits that regulate FLI1 in at least 3 can-
cer types. ZC4H2 (OMIM: 300897) appeared to be the most
frequent TF (in 5 cancers) for FLI1 expression regulation (Fig-
ure 4C). In literature, it is still largely unknown which TFs are
involved in the transcriptional regulation of FLI1 in cancers,
and the TF-CpG combinations that we present here could be
candidates for further investigations. Take stomach adenocarci-
noma (STAD) as an example to study the cancer type-specific
A
B
C D
Figure 4. Prediction Powers of the MeTRNs and CNV for Expression Profiles of the Cancer-Related Genes
(A) Linear regression models with different predictor variables were used to fit the expression profile of each cancer-related gene in the MeTRN in the tumors for
each cancer type. The coefficient of determination (R2) for each gene was calculated from these models (Target�TF+CpG sites, Target�CNV, and
Target�TF+CpG sites + CNV) with data from a particular cancer type. Finally, boxplots were prepared to show the distributions of the R2 values of all the genes
with these 3 different models for each cancer type. Whiskers extend to 1.5 times of interquartile range.
(B) Top 10 genes were shown as examples of the cancer genes that are highly dependent on the MeTRN regulators in multiple types of cancer, as shown by their
R2 values from the linear regressionmodel of Target�TF+CpG sites. The specific R2 values from themodel of Target�TF+CpG, Target�CpG sites, or Target�TFs
in particular cancer types weremarked with vertical bars, of which the color indicates the type of cancer. Numbers of cancer types in which a gene was found as a
target in the MeTRN networks are shown in parentheses following the gene names.
(C) 5 TFs predicted to target FLI1 in at least 3 types of cancers. 42 out of 46 promoter CpG sites were involved in these transcriptional circuits.
(D) 18 TFs predicted to target FLI1 in STAD. 12 of these 18 TFs were predicted to target FLI1 only in STAD. 36 of the FLI1 promoter CpG sites were involved.
regulation. Recently, FLI1 has been shown to be subjected to
DNA methylation-mediated dysregulation in gastric cancer (Se-
pulveda et al., 2016), but the detailed mechanisms of such regu-
lation are not clear. TheMeTRN of STAD hasmapped 18 TFs and
36 CpG sites for the transcriptional regulation circuitry of FLI1
(Figure 4D). Among these 18 TFs, EHF (OMIM: 605439) has
been shown to target FLI1 according to the ChIP-seq data. In
brief, the case of FLI1 as an example shows what types of in-
sights into the gene transcriptional regulation program can be
extracted from the MeTRNs as a resource.
Finally, considering the dynamic gene expression determina-
tion powers of the MeTRNs across different cancer contexts,
we evaluated the similarities among the 21 cancer types in de-
pendencies of the cancer genes on the cancer-specific MeTRN
Cell Reports 26, 3461–3474, March 19, 2019 3467
Figure 5. Similarities between Cancers
Based on Prediction Powers of the MeTRNs
and CNV for the Cancer Gene Expressions
(A) Cancer similarity matrices indicating the Pear-
son correlation between the R2 values of the can-
cer genes in each pair of the cancers. The R2
values were calculated from linear regression
models of Target�TF+CpG (upper triangle) or from
the model of Target�CNV (lower triangle).
(B) Scatterplots showing the dependencies of the
cancer genes on the MeTRN regulators (left) or on
the CNV (right), in two cancers as examples, STAD
and LUAD.
circuitry. Specifically, Pearson’s correlation between the R2
values of the cancer genes for each pair of the cancers, from
the model of Target�TF+CpG, was used as a quantification of
the similarity between the two cancers (Figure 5A, upper
triangle). With the same strategy, the cancer similarities were
also calculated based on the R2 values from the model of
Target�CNV (Figure 5A, lower triangle). It appears that, in gen-
eral, dependencies of the cancer genes on the cancer-specific
MeTRNs vary substantially across different types of cancers,
whereas their dependencies on the CNV are relatively more
conserved across cancers (Figure 5A). For example, scatterplots
of the R2 values of the cancer genes in two cancers, STAD and
lung adenocarcinoma (LUAD) (Figure 5B), showed amuch higher
correlation of the R2 values from Target�CNV than the R2 values
from Target�TF+CpG between the two cancers. Indeed, many
commonly known cancer genes bear copy number alterations
in multiple types of cancers, which could result in similar depen-
dencies of these genes on CNV. In contrast, it has been shown
that different tissues or cancers have highly heterogeneous
DNA methylomes (Kundaje et al., 2015; Schultz et al., 2015; Var-
ley et al., 2013), which could give rise to different levels of depen-
dencies of gene expression on CpG-modulated transcriptional
regulations across cancers. Therefore, our results indicate that
an inter-cancer heterogeneity potentially resulted from highly
dynamic and heterogeneous transcriptional regulation that is
dependent on the context-specific DNA methylomes.
3468 Cell Reports 26, 3461–3474, March 19, 2019
Classifications of Cancer SubtypesBased on the Regulatory Factors inthe MeTRNsAs previously discussed, by mapping
the transcriptional regulation circuits
composed of promoter CpG sites
coupled with specific TFs, the MeTRNs
recapitulated an important up-stream
layer of gene expression regulation.
Here, we further explored potential of
the contributing factors in the MeTRNs,
i.e., TFs and CpG sites, as classifiers of
cancer patients. Again, we selected TF-
CpG circuits that perform well in deter-
mining the target gene expressions (R2 >
0.5) for each cancer type. These collec-
tions of CpG sites and TFs were then
used for an unsupervised clustering anal-
ysis to classify patients of each cancer type (kidney renal papil-
lary cell carcinoma [KIRP] as an example shown in Figure 6A and
the other 20 in Figure S4). Cancer type-specific Kaplan-Meier
survival curves were then prepared for each of the patient sub-
groups (KIRP in Figure 6B and the other 20 in Figure S5). Ten
of the 21 cancer types, namely, KIRP, brain lower grade glioma
[LGG], pancreatic adenocarcinoma [PAAD], breast invasive car-
cinoma [BRCA], sarcoma [SARC], acute myeloid leukemia
[LAML], kidney renal clear cell carcinoma [KIRC], liver hepatocel-
lular carcinoma [LIHC], skin cutaneous melanoma [SKCM], and
cervical squamous cell carcinoma and endocervical adenocarci-
noma [CESC], showed significantly different survival curves
among different patient subgroups (smallest p value < 0.05) (Fig-
ures 6B and S5), which suggests that the TFs and CpG sites that
have major roles in determining the target gene expression are
potential prognostic biomarkers for these cancers. The other
cancer types did not show significantly different prognoses
associated to the classifications (p > 0.05, Figures S4 and S5),
suggesting that for these cancers, the epigenetically encoded
transcriptional regulation mapped by the MeTRNs may not be
a dominating determinant of cancer aggressiveness.
Previously, the DNA methylome profiles (usually for the most
variable CpG sites) were often used for classification of cancer
subtypes in typical cancer genomics studies (Cancer Genome
Atlas Network, 2012; Cancer Genome Atlas Research Network,
2011, 2012, 2014). In many cases, such cancer subtypes also
A B
DC
Figure 6. TFs and CpG Sites in MeTRNs
Serve as Classifiers of Prognostically
Different Patient Subgroups
(A) Take KIRP as an example. Unsupervised hier-
archical clustering analysis was performed with
the CpG site methylation and TF expression pro-
files of the DNA methylation-dependent tran-
scriptional regulation circuits in the MeTRN that
exhibited strong prediction power for the target
gene expression profiles, as shown from the linear
combinatory model Target�TF+CpG. The patient
subgroups (k1–k4) from the clustering analysis are
marked by different colors. Similar results for the
other 20 cancer types are provided in Figure S4.
(B–D) Kaplan-Meier survival curves showing
comparisons of the overall survival between
different subgroups of KIRP patients identified
based on their MeTRN regulators (B), the top
variable CpG sites (C), or the TFs (D) predicted by
ARACNe. The p value for the statistical signifi-
cance of the largest prognosis difference among
the cancer subtypes was inferred with a log-rank
test.
The survival curves of the other 20 cancer types
are shown in Figures S5–S7.
showed certain levels of prognostic differences. For the purpose
of comparison with the subtype classifications based on the
MeTRN regulators, we followed the canonical strategy and per-
formed tumor-clustering analysis with the promoter CpG sites
that have the most variable methylation profiles for each type
of cancer. The survival curves of these cancer subtypes are
shown in Figure 6C for KIRP and Figure S6 for the other 20 can-
cers, and similarly, the prognostic differences were quantified
through p values. Take KIRP as an example, which exhibited
the most significant prognosis differences. Compared with the
CpG sites with the most variable methylation profiles, the
coupled TFs and CpG sites in the MeTRN indeed resulted in
wider separations of the survival curves and a more significant
p value (Figures 6B and 6C). Such superior classification poten-
tial of theMeTRN regulators to that of the dynamic CpG siteswas
observed in many of the other 20 cancer types with some excep-
tions, in which the different patient classification strategies re-
sulted in similar levels of prognostic differences (Figures S5
and S6). In addition, we reconstructed TF-target transcriptional
regulation networks with ARACNe (Basso et al., 2005; Margolin
et al., 2006), which does not take into account the DNA methyl-
ation profiles. We then followed the same pipeline and selected
the TFs with strong prediction powers for target gene expression
profiles. Similarly, we tested the performances of these TFs
Cell Rep
alone in defining prognostic cancer sub-
types. As shown in Figures 6D and S7,
in KIRP and many other cancer types,
this strategy of patient classification was
also outperformed by the TFs and CpGs
in combination from the MeTRNs as clas-
sifiers. Therefore, by coupling the regula-
tory TFs and the modulating CpG sites,
the MeTRNs provide an alternative strat-
egy for classifying the cancer subtypes, which is indeed associ-
ated with the prognoses, although further investigations would
be needed to fully elucidate the physiological relevance and
the molecular basis of these observations.
Finally, to further investigate the biological relevance of the
cancer classifications based on the MeTRN regulators, we
took the two subtypes of KIRP, k1 and k3, as examples and
looked into their difference. These two subtypes, classified by
the TFs and CpG sites shown in Figure 6A, have themost distinct
survival curves among all the KIRP subtypes (Figure 6B). The
5-year survival rate for patients in k1 was about 95%, while
most of the patients in the aggressive subtype k3 died within
2 to 3 years. Therefore, it is of great value to elucidate which
and how transcriptional regulation circuits were altered in sub-
type k3 with the poorest clinical outcome. Differential methyl-
ation analysis of the CpG sites used for the clustering analysis
in Figure 6A showed that a major proportion of the strong and
significant methylation variations were upregulations in k3
versus k1 (Figure 7A). On the other hand, the gene differential
expression analysis of the TFs used for the subtype classifica-
tions showed the opposite trend, i.e., a major proportion of the
significant TF expression variations were downregulations in k3
versus k1 (Figure 7B). The top 781 upregulated CpG sites and
the top 17 downregulated TFs in k3 were selected (Figures 7A
orts 26, 3461–3474, March 19, 2019 3469
A B
C D
Figure 7. Comparison between Two Prognostically Different KIRP Subtypes Classified by MeTRN Regulators
(A and B) Differential methylation (A) and differential expression (B) analyses of the MeTRN regulators (CpG sites and TFs, respectively) that were used for the
clustering analysis in Figure 6. 781 CpG sites and 17 TFs were deemed strongly up- and downregulated, respectively, in k3 versus k1.
(C) 598 target genes of the CpG sites and TFs identified in (A) and (B) were found in the KIRPMeTRN. GSEA analysis was performed to show enrichment of the 598
target genes in the downregulated genes by comparing k3 versus k1.
(D) Gene functional enrichment analysis of the 598 target genes. The enrichment p value cut-off was set at 0.01.
and 7B), and they were involved in the regulatory circuits of 598
target genes in the KIRP MeTRN (Figure 7C). A gene set enrich-
ment analysis (GSEA) showed that these 598 genes were signif-
icantly enriched in the downregulated genes in k3 versus k1, with
the differential expression profile of all the genes as background
(Figure 7C; Data S5). In other words, downregulation of the TFs
and upregulation of the CpG sites were associated to repression
of their target genes mapped by the MeTRN. Furthermore, these
598 genes were significantly enriched by various key functions or
processes related to cancer, such as cell adhesion, differentia-
tion, vasculogenesis, Wnt signaling, BRD4 complex, and cell
proliferation (Figure 7D).
In summary, the cancer-specific MeTRNs did not only provide
classifiers of cancer subtypes that are physiologically and bio-
logically relevant, but also mapped the down-stream gene
expression changes that were resulted from perturbations of
the transcriptional regulation circuitry at the TFs or CpG sites.
Therefore, by coupling the TF-target transcriptional regulatory
circuits with specific promoter CpG sites, the MeTRNs is a
3470 Cell Reports 26, 3461–3474, March 19, 2019
unique resource for understanding the transcriptional regulation
machinery underlying cancer and dissecting the driving forces of
gene expression dysregulation in tumors.
DISCUSSION
Methylation levels of the promoter CpG sites have been recog-
nized as potent regulators of gene expression (Robertson,
2005; Smith and Meissner, 2013), and dysregulation of the pro-
moter CpG methylation has been connected to a wide range of
physiological consequences, including cancer (Chatterjee and
Vinson, 2012; Jones, 2012; Liu et al., 2018). Most previous
studies were focused on specific CpG sites or regions and
have derived detailed epigenetic gene regulation machineries.
A comprehensive survey of the potential functions of the cancer
context-specific DNAmethylome, which should have substantial
value for understanding the multi-level gene expression regula-
tion machinery in cancers, is still largely unavailable at the
single-CpG site level.
Our survey of the CpG-gene correlations revealed significant
numbers of CpG sites that are correlated, but not at a very
strong level, with the gene expression. Many of the CpG-
gene pairs showed very interesting patterns as shown in Fig-
ures 1B and 1C. We reason that for a given gene whose TF
(or TFs) is redundantly available, its expression level would
mainly depend on the accessibility of the TF protein to the pro-
moter region. In such cases, if the TF binding depends on
methylation (or demethylation) of a CpG site, then the methyl-
ation level of this CpG site would serve as a rate-limiting factor
of the transcription activation or repression, which would lead
to a strong correlation between DNA methylation and gene
expression. On the other hand, in a similar scenario but with
a limited supply of the TF protein, given a certain fixed level
of DNA methylation, gene transcription would be sensitive to
abundance of the TF. However, eventually the DNA methylation
level determines the upper or lower limit that the gene expres-
sion can reach by controlling the maximum accessibility to the
CpG site by the TF. This explains the ‘‘triangle-shaped’’ corre-
lation patterns between the CpG site methylation and gene
expression levels exemplified in Figures 1B and 1C. This
prompted us to systematically identify such types of methyl-
ation level-dependent transcriptional regulation circuits on a
genome-wide scale in different cancers.
This task was done by reanalyzing the multi-omics data in
TCGA, which offers unique advantages for integrative data anal-
ysis, including large sample sizes, broad coverage of cancer
types, parallel multi-omics datasets, consistent procedures for
high-throughput profiling, and the availability of the patient sur-
vival data. Most importantly, the inter-tumoral heterogeneity
within each type of cancer produced largely dynamic datasets,
which made it possible to conduct sophisticatedly designed sta-
tistical analyses for mining of data patterns indicating potential
regulation machineries. Specifically, we designed an analysis
pipeline based on the concept of cMI to quantify the dependency
of each possible TF-gene association on the methylation level of
each promoter CpG site. Estimations of MI and cMI require
dynamic datasets with large sample sizes, and that is one of
the reasons why the cancer type-specific multi-omics data in
TCGA is best suited for our analysis pipeline. Outside of TCGA,
datasets from large cancer cohorts that meet the requirements
of our analysis method are extremely rare. The criteria include
large sample size (>150), independent tumor samples, matched
data of RNA expression, DNA methylation, and DNA copy
numbers from each tumor.
In TCGA, the dataset of BRCA has the largest number of
normal tissue samples, among which only 85 were usable for
our pipeline as these samples have matched data of DNA
methylation, mRNA expression, and DNA copy numbers. Unlike
the tumor tissues, the normal tissues are much less heteroge-
neous. Dynamic ranges of the multi-omics profiles of normal tis-
sues are much smaller. Therefore, due to the small sample size
and limited data variation, inference of mutual information and
reconstruction of MeTRN are less reliable. Indeed, the MeTRN
from normal tissue data of BRCA and the MeTRN from tumor
data are very different (data not shown). However, as discussed
above, although such a difference can be attributed to cancer
specificity of MeTRNs, quality of the normal tissue data (smaller
sample size and limited dynamic range) could be another main
reason of the difference between cancer and normal MeTRNs.
It worth noting that the multi-omics profiles used in our study
were obtained by TCGA from bulk tumor samples, which are het-
erogeneous mixtures of tumor cells and others, such as stromal
cells and lymphocytes. Previous studies have comprehensively
evaluated purity of the tumor tissue samples profiled by TCGA
(Aran et al., 2015; Zheng et al., 2017). We took the data from
(Aran et al., 2015), in which the per-sample purities were inferred
with 4 different methods and the consensus were taken. Majority
of the tumors used in our study have purities ranging around 0.8,
and a small proportion of the tumor samples fall below 0.6 (for
supporting information, see Figure 5B in Liu et al., 2018). Some
cancer types have slightly lower tumor purities in general, but still
most are higher than 0.6. Therefore, although tumor heterogene-
ity is an obvious issue here, the tumor tissue samples in TCGA
were still dominated by tumor cells of particular types of cancer.
This is the basis for most of the cancer cohort studies based on
TCGA data or other datasets obtained from tumor tissues. As far
as we know, there is not a consensus strategy to fully address
the issue of tumor heterogeneity for large-scale cohort studies,
and therefore, we followed the common practice and relied on
the assumption that the molecular profiles are largely reflective
of the dominating tumor cells.
In the present study, a DNA MeTRN was assembled for each
of the 21 major cancer types in TCGA. By coupling the specific
CpG sites, TFs, and the target gene, the MeTRN is a de novo
reconstruction of the promoter DNA methylation level-depen-
dent transcriptional regulation circuitry in each cancer type,
which we believe provides the basis for further context-specific
studies of the functional DNAmethylation at the resolution of sin-
gle-CpG sites and particular TFs for specific genes. On the other
hand, the purpose of our study was to capture the promoter CpG
sites that are potentially involved in TF-target regulations, and
our methodology does not presume that the CpG sites being
tested here are the only determining factors. In other words,
our focus on the promoter CpG sites does not preclude other
distal machineries, such as the enhancer CpG sites (or sites in
other potential regulatory regions such as CGI, shores, and
shelves). For these potentially distal regulatory CpG sites, given
their increased distances to TSS, it becomes difficult to precisely
identify the ‘‘target’’ genes that are being regulated by these
sites. In addition, the coverage of these distal CpG sites in the
current data is poor compared to the promoter sites. However,
the current analysis can certainly be repeated for the enhancer
CpG sites if we have precise definitions of the context-specific
enhancer regions for each gene and if the similar large-scale
enhancer CpG site methylation datasets are available.
Our computational analysis pipeline was based on the
concept of cMI, which quantifies the gain of mutual information
between a TF and a target gene due to introduction of the
methylation level of a CpG site in the promoter region. Essen-
tially, the computational analysis itself captured associations
rather than causational regulations. However, our analysis was
limited to each particular gene coupled with the CpG sites in
the promoter region. Therefore, an assumption here is that the
association, if any, between expression of a gene and methyl-
ation level of a promoter CpG site, should be due to a directional
Cell Reports 26, 3461–3474, March 19, 2019 3471
regulation of the gene expression by the promoter CpG site,
which affects accessibility of the promoter to a TF. Although
there could be some exceptions, we believe that such assump-
tion should be valid in most cases and many previous studies
about DNA methylation have been based on the same assump-
tion as well. Similarly, a strong TF-target gene association does
not necessarily indicate a directional regulation circuit from the
TF to the target gene. Many alternative regulation machineries
could result in such strong association, for example the TF and
the target gene both being regulated by another upstream regu-
lator. Therefore, we kept a TF-target gene pair only if the TF-gene
expression association depends on the methylation level of a
promoter CpG site of the target gene. Such a criterion would
exclude many alternative machineries other than a TF-to-target
gene circuit, because these potential alternative machineries
would not be directly dependent on the promoter CpG site.
To showcase the applications of these informative cancer-
specific MeTRN networks in gaining insights into the cancer
type-dependent gene regulatory machinery, we quantitatively
assessed the contributions of the regulatory circuits in the
MeTRNs in gene expression regulation. Our results suggested
that the epigenetically modulated transcriptional regulation
machinery encoded in the MeTRNs is independent from the
DNA copy numbers and is at least equally, if not more, potent
than the DNA copy numbers for determining the gene expres-
sion dynamics in tumors (Figures 3 and S3). Many genes with
no CNV or that are not responsive to their CNVs, including
many known cancer genes, are being controlled via such
epigenetically encoded transcriptional regulation. Cancer-
related biological processes are enriched in these genes,
and subgroups of patients with different methylome patterns
at the CpG sites selected by the MeTRNs showed largely
different prognoses, which suggests potential physiological
consequences of perturbing the transcriptional regulation
scheme at the DNA methylation level. Therefore, MeTRN is a
high-resolution functional survey of the DNA methylation-
dependent transcriptional regulations in cancers. By recapitu-
lating the epigenetic scheme involved in the cancer context-
dependent transcriptional regulation circuitry, our results
serve as a framework and resource for dissecting the driving
force of gene expression regulation related to tumorigenesis
and cancer development.
STAR+METHODS
Detailed methods are provided in the online version of this paper
and include the following:
d KEY RESOURCES TABLE
d CONTACT FOR REAGENT AND RESOURCE SHARING
d METHOD DETAILS
3472
B Data collection and pre-processing
B ChIP-seq data collection and analysis
B Genome-wide identifications of the TF-target regulato-
ry circuits modulated by CpG site methylation
B Implementation of the pipeline
B Functional enrichment analysis
B Patient classification and survival analysis
Cell Reports 26, 3461–3474, March 19, 2019
B Differential expression and differential methylation
analysis
d QUANTIFICATION AND STATISTICAL ANALYSIS
B Software and statistical analysis
B Strategy for randomization
B Data exclusion
d DATA AND SOFTWARE AVAILABILITY
SUPPLEMENTAL INFORMATION
Supplemental Information can be found with this article online at https://doi.
org/10.1016/j.celrep.2019.02.084.
ACKNOWLEDGMENTS
The authors wish to acknowledge support from the Computing Platform and
the Gene Sequencing Platform of National Protein Science Facility (Beijing).
This work was supported by the National Key Research and Development Pro-
gram, PrecisionMedicine Project (2016YFC0906001 to X.Y.), the National Nat-
ural Science Foundation of China (31671381, 91540109, and 81472855 to
X.Y.), the Tsinghua University Initiative Scientific Research Program
(2014z21046 to X.Y.), the Tsinghua–Peking Joint Center for Life Sciences,
and the 1000 Talent Program (Youth Category).
AUTHOR CONTRIBUTIONS
Yu Liu, Yang Liu, and X.Y. conceived and designed the study. Yu Liu devel-
oped the algorithmic analysis pipeline for reconstruction, validation, and anal-
ysis of theMeTRNs, with help from Z.X. and J.W. Yang Liu and R.H. performed
the patient classification analysis and prepared survival curves, with help from
W.S., Y.Y., and S.D. X.Y. supervised the whole project. X.Y. wrote the manu-
script with input from Yu Liu, Yang Liu, and R.H.
DECLARATION OF INTERESTS
The authors declare no competing interests.
Received: March 19, 2018
Revised: October 1, 2018
Accepted: February 21, 2019
Published: March 19, 2019
REFERENCES
Almeida, L., Silva, J.A., Andrade, V.M., Machado, P., Jamieson, S.E., Car-
valho, E.M., Blackwell, J.M., and Castellucci, L.C. (2017). Analysis of expres-
sion of FLI1 andMMP1 in American cutaneous leishmaniasis caused by Leish-
mania braziliensis infection. Infect. Genet. Evol. 49, 212–220.
Aran, D., Sirota, M., and Butte, A.J. (2015). Systematic pan-cancer analysis of
tumour purity. Nat. Commun. 6, 8971.
Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., and Cal-
ifano, A. (2005). Reverse engineering of regulatory networks in human B cells.
Nat. Genet. 37, 382–390.
Blattler, A., and Farnham, P.J. (2013). Cross-talk between site-specific tran-
scription factors and DNA methylation states. J. Biol. Chem. 288, 34287–
34294.
Budden, D.M., Hurley, D.G., Cursons, J., Markham, J.F., Davis, M.J., and
Crampin, E.J. (2014). Predicting expression: the complementary power of his-
tonemodification and transcription factor binding data. Epigenetics Chromatin
7, 36.
Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of
human breast tumours. Nature 490, 61–70.
Cancer Genome Atlas Network (2015). Comprehensive genomic characteriza-
tion of head and neck squamous cell carcinomas. Nature 517, 576–582.
Cancer Genome Atlas Research Network (2011). Integrated genomic analyses
of ovarian carcinoma. Nature 474, 609–615.
Cancer Genome Atlas Research Network (2012). Comprehensive genomic
characterization of squamous cell lung cancers. Nature 489, 519–525.
Cancer Genome Atlas Research Network (2014). Comprehensive molecular
profiling of lung adenocarcinoma. Nature 511, 543–550.
Chatterjee, R., and Vinson, C. (2012). CpGmethylation recruits sequence spe-
cific transcription factors essential for tissue specific gene expression. Bio-
chim. Biophys. Acta 1819, 763–770.
Domcke, S., Bardet, A.F., Adrian Ginno, P., Hartl, D., Burger, L., and
Sch€ubeler, D. (2015). Competition between DNAmethylation and transcription
factors determines binding of NRF1. Nature 528, 575–579.
Frenzel, S., and Pompe, B. (2007). Partial mutual information for coupling anal-
ysis of multivariate time series. Phys. Rev. Lett. 99, 204101.
Hu, Z., Killion, P.J., and Iyer, V.R. (2007). Genetic reconstruction of a functional
transcriptional regulatory network. Nat. Genet. 39, 683–687.
Hu, S., Wan, J., Su, Y., Song, Q., Zeng, Y., Nguyen, H.N., Shin, J., Cox, E., Rho,
H.S., Woodard, C., et al. (2013). DNA methylation presents distinct binding
sites for human transcription factors. eLife 2, e00726.
Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Sci-
ence & Engineering 9, 90–95.
Jiang, P., Freedman,M.L., Liu, J.S., and Liu, X.S. (2015). Inference of transcrip-
tional regulation in cancers. Proc. Natl. Acad. Sci. USA 112, 7731–7736.
Johnson, D.S., Mortazavi, A., Myers, R.M., andWold, B. (2007). Genome-wide
mapping of in vivo protein-DNA interactions. Science 316, 1497–1502.
Jones, P.A. (2012). Functions of DNA methylation: islands, start sites, gene
bodies and beyond. Nat. Rev. Genet. 13, 484–492.
Kraskov, A., Stogbauer, H., and Grassberger, P. (2004). Estimating mutual in-
formation. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 69, 066138.
Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi,
A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M.J., et al.; Roadmap Epige-
nomics Consortium (2015). Integrative analysis of 111 reference human epige-
nomes. Nature 518, 317–330.
Li, Y., Liang, M., and Zhang, Z. (2014). Regression analysis of combined gene
expression regulation in acute myeloid leukemia. PLoS Comput. Biol. 10,
e1003908.
Lin, P.C., Lin, J.K., Lin, C.H., Lin, H.H., Yang, S.H., Jiang, J.K., Chen, W.S.,
Chou, C.C., Tsai, S.F., and Chang, S.C. (2015). Clinical Relevance of Plasma
DNA Methylation in Colorectal Cancer Patients Identified by Using a
Genome-Wide High-Resolution Array. Ann. Surg. Oncol. 22 (Suppl 3),
S1419–S1427.
Liu, Y., Toh, H., Sasaki, H., Zhang, X., and Cheng, X. (2012). An atomic model
of Zfp57 recognition of CpG methylation within a specific DNA sequence.
Genes Dev. 26, 2374–2379.
Liu, Y., Huang, R., Liu, Y., Song,W.,Wang, Y., Yang, Y., Dong, S., and Yang, X.
(2018). Insights from multidimensional analyses of the pan-cancer DNA meth-
ylome heterogeneity and the uncanonical CpG-gene associations. Int. J. Can-
cer 143, 2814–2827.
Marbach, D., Costello, J.C., K€uffner, R., Vega, N.M., Prill, R.J., Camacho,
D.M., Allison, K.R., Kellis, M., Collins, J.J., and Stolovitzky, G.; DREAM5 Con-
sortium (2012). Wisdom of crowds for robust gene network inference. Nat.
Methods 9, 796–804.
Margolin, A.A., Wang, K., Lim, W.K., Kustagi, M., Nemenman, I., and Califano,
A. (2006). Reverse engineering cellular networks. Nat. Protoc. 1, 662–671.
McKinney, W. (2010). Data Structures for Statistical Computing in Python. In
Proceedings of the 9th Python Science Conference, pp. 51–56.
Mermel, C.H., Schumacher, S.E., Hill, B., Meyerson, M.L., Beroukhim, R., and
Getz, G. (2011). GISTIC2.0 facilitates sensitive and confident localization of the
targets of focal somatic copy-number alteration in human cancers. Genome
Biol. 12, R41.
Millman, K.J., and Aivazis, M. (2011). Python for Scientists and Engineers.
Computing in Science & Engineering 13, 9–12.
Misra, S., Pamnany, K., and Aluru, S. (2015). Parallel Mutual Information Based
Construction of Genome-Scale Networks on the Intel� Xeon Phi� Copro-
cessor. IEEE/ACM Trans. Comput. Biol. Bioinformatics 12, 1008–1020.
Neph, S., Stergachis, A.B., Reynolds, A., Sandstrom, R., Borenstein, E., and
Stamatoyannopoulos, J.A. (2012a). Circuitry and dynamics of human tran-
scription factor regulatory networks. Cell 150, 1274–1286.
Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B.,
Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al. (2012b). An
expansive human regulatory lexicon encoded in transcription factor footprints.
Nature 489, 83–90.
Odersky, M., Altherr, P., Cremet, V., Dragos, I., Dubochet, G., Emir, B.,
McDirmid, S., Micheloud, S., Mihaylov, N., Schinz, M., et al. (2004). An Over-
view of the Scala Programming Language (Ecole Polytechnique Federale de
Lausanne).
Oliphant, T.E. (2015). Guide to NumPy, 2nd Edition (CreateSpace Independent
Publishing Platform).
Pedregosa, F., Varoquax, G., Gramfot, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:
Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
R Core Team (2008). R: A language and environment for statistical computing.
R Foundation for Statistical Computing (Austria: Vienna). https://www.
R-project.org.
Ranzani, V., Rossetti, G., Panzeri, I., Arrigoni, A., Bonnal, R.J., Curti, S.,
Gruarin, P., Provasi, E., Sugliano, E., Marconi, M., et al. (2015). The long inter-
genic noncoding RNA landscape of human lymphocytes highlights the regula-
tion of T cell differentiation by linc-MAF-4. Nat. Immunol. 16, 318–325.
Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G.,
Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., and Sabeti, P.C. (2011). De-
tecting novel associations in large data sets. Science 334, 1518–1524.
Robertson, K.D. (2005). DNA methylation and human disease. Nat. Rev.
Genet. 6, 597–610.
Sathyamurthy, A., Johnson, K.R., Matson, K.J.E., Dobrott, C.I., Li, L., Ryba,
A.R., Bergman, T.B., Kelly, M.C., Kelley, M.W., and Levine, A.J. (2018).
Massively Parallel Single Nucleus Transcriptional Profiling Defines Spinal
Cord Neurons and Their Activity during Behavior. Cell Rep. 22, 2216–2225.
Schultz, M.D., He, Y., Whitaker, J.W., Hariharan, M., Mukamel, E.A., Leung, D.,
Rajagopal, N., Nery, J.R., Urich, M.A., Chen, H., et al. (2015). Human body epi-
genome maps reveal noncanonical DNA methylation variation. Nature 523,
212–216.
Sepulveda, J.L., Gutierrez-Pajares, J.L., Luna, A., Yao, Y., Tobias, J.W.,
Thomas, S., Woo, Y., Giorgi, F., Komissarova, E.V., Califano, A., et al.
(2016). High-definition CpG methylation of novel genes in gastric carcinogen-
esis identified by next-generation sequencing. Mod. Pathol. 29, 182–193.
Smith, Z.D., and Meissner, A. (2013). DNA methylation: roles in mammalian
development. Nat. Rev. Genet. 14, 204–220.
Sopko, R., Huang, D., Preston, N., Chua, G., Papp, B., Kafadar, K., Snyder, M.,
Oliver, S.G., Cyert, M., Hughes, T.R., et al. (2006). Mapping pathways and phe-
notypes by systematic gene overexpression. Mol. Cell 21, 319–330.
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gil-
lette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al.
(2005). Gene set enrichment analysis: A knowledge-based approach for inter-
preting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102,
15545–15550.
Therneau, T.M., and Grambsch, P.M. (2000). Modeling Survival Data: Extend-
ing the Cox Model (New York: Springer).
Tripathi, S., Pohl, M.O., Zhou, Y., Rodriguez-Frandsen, A., Wang, G., Stein,
D.A., Moulton, H.M., DeJesus, P., Che, J., Mulder, L.C.F., et al. (2015).
Meta- and Orthogonal Integration of Influenza "OMICs" Data Defines a Role
for UBR4 in Virus Budding. Cell Host Microbe 18, 723–735.
van Rossum, G. (1995). Python Tutorial (the Netherlands: Centre for Mathe-
matics and Computer Science, Amsterdam).
Cell Reports 26, 3461–3474, March 19, 2019 3473
Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., and Luscombe, N.M.
(2009). A census of human transcription factors: function, expression and evo-
lution. Nat. Rev. Genet. 10, 252–263.
Varley, K.E., Gertz, J., Bowling, K.M., Parker, S.L., Reddy, T.E., Pauli-Behn, F.,
Cross, M.K., Williams, B.A., Stamatoyannopoulos, J.A., Crawford, G.E., et al.
(2013). Dynamic DNAmethylation across diverse human cell lines and tissues.
Genome Res. 23, 555–567.
Wang, Y., Fan, P.S., and Kahaleh, B. (2006). Association between enhanced
type I collagen expression and epigenetic repression of the FLI1 gene in
scleroderma fibroblasts. Arthritis Rheum. 54, 2271–2279.
Wang, K., Alvarez, M.J., Bisikirska, B.C., Linding, R., Basso, K., Dalla Favera,
R., and Califano, A. (2009a). Dissecting the interface between signaling and
transcriptional regulation in human B cells. Pac SympBiocomput 14, 264–275.
Wang, K., Saito, M., Bisikirska, B.C., Alvarez, M.J., Lim, W.K., Rajbhandari, P.,
Shen, Q., Nemenman, I., Basso, K., Margolin, A.A., et al. (2009b). Genome-
3474 Cell Reports 26, 3461–3474, March 19, 2019
wide identification of post-translational modulators of transcription factor ac-
tivity in human B cells. Nat. Biotechnol. 27, 829–839.
Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., Tang, Q., Meyer, C.A.,
Zhang, Y., and Liu, X.S. (2013). Target analysis by integration of transcriptome
and ChIP-seq data with BETA. Nat. Protoc. 8, 2502–2515.
Yang, Y., Fear, J., Hu, J., Haecker, I., Zhou, L., Renne, R., Bloom, D., andMcIn-
tyre, L.M. (2014). Leveraging biological replicates to improve analysis in ChIP-
seq experiments. Comput. Struct. Biotechnol. J. 9, e201401002.
Zheng, X., Zhang, N., Wu, H.J., and Wu, H. (2017). Estimating and accounting
for tumor purity in the analysis of DNA methylation data from cancer studies.
Genome Biol. 18, 17.
Zhu, H., Wang, G., and Qian, J. (2016). Transcription factors as readers and ef-
fectors of DNA methylation. Nat. Rev. Genet. 17, 551–565.
STAR+METHODS
KEY RESOURCES TABLE
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data
DNA methylation data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga
gene expression data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga
DNA copy number data (level 3) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga
Somatic mutation data (level 2) TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga
Transcription Start Sites genomic
annotation (hg19)
UCSC Table Browser https://genome.ucsc.edu/cgi-bin/hgTables
ChIP-seq peak data ENCODE Consortium https://www.encodeproject.org/
Software and Algorithms
GISTIC 2 Mermel et al., 2011 https://portals.broadinstitute.org/cgi-bin/cancer/
publications/pub_paper.cgi?mode=view&paper_id=
216&p=t
The estimation of the mutual information Kraskov et al., 2004 N/A
Condition mutual information Frenzel and Pompe, 2007 N/A
MINDy algorithm Wang et al., 2009b N/A
Metascape Tripathi et al., 2015 http://metascape.org
R package ‘survival’ Therneau and Grambsch, 2000 https://cran.r-project.org/web/packages/survival/
index.html
MATLAB The MathWorks, Inc. https://www.mathworks.com/products/matlab.html
Python van Rossum, 1995 https://www.python.org/
Numpy Oliphant, 2015 http://www.numpy.org/
Pandas McKinney, 2010 https://pandas.pydata.org/
Matplotlib Hunter, 2007 https://matplotlib.org/
Scipy Millman and Aivazis, 2011 https://www.scipy.org/
Java 8 SE Oracle Corporation https://www.oracle.com/java/technologies/java-se.
html
Scala Odersky et al., 2004 https://www.scala-lang.org/
R R Core Team, 2008 http://www.R-project.org
F# F# Software Foundation https://fsharp.org/
Node.js Node.js Foundation https://nodejs.org/en/
Scikit-learn Pedregosa et al., 2011 http://scikit-learn.org/stable/
GSEA software Subramanian et al., 2005 https://www.broadinstitute.org/gsea/
Other
Code used for this publication This paper Supplementary file 8
TF list This paper Supp file
Cancer-related gene list NCBI (Entrez) Gene Database https://www.ncbi.nlm.nih.gov/gene
CONTACT FOR REAGENT AND RESOURCE SHARING
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Xuerui
Yang ([email protected]).
METHOD DETAILS
Data collection and pre-processingSomatic mutation data of tumors (level 2), gene expression, DNA copy number, and DNAmethylation data (level 3) of tumor and adja-
cent normal samples from 21 cancer types were downloaded from TCGA (Table S1). Each cancer type has at least 150 independent
tumor samples, of which all the 3 types of data (gene expression, DNA copy number, and CpG methylation) are available.
Cell Reports 26, 3461–3474.e1–e5, March 19, 2019 e1
Specifically, RNA-seq V2 data (level 3) was used for gene expression profiles of all 21 cancer types. These are read counts of genes
and have been already processed and normalized across all the samples of each cancer type by TCGA with the upper quantile
normalization method.
For the DNA copy number data, we used the Affymetrix Genome-Wide Human SNP Array 6.0 segmentation data provided by
TCGA (level 3). In detail, the ‘‘nocnv.seg’’ files for each sample were collected from TCGA to capture the somatic CNV. GISTIC 2
(Mermel et al., 2011) was then used with default parameters and the ‘‘-savegene’’ option to recover the copy numbers at the
gene level. From the results of GISTIC 2, we used the raw continuous CNV value for each gene for our down-stream analyses.
Throughout the present study, only the promoter CpG sites in the range of ± 2.5kb from the TSSs were considered. Genomic an-
notations of the TSSs were obtained from the UCSC Table Browser (hg19). For genes with multiple TSSs, all the CpG sites falling in ±
2.5kb from any of the TSSs were considered as promoter CpG sites. A small number of CpG sites were found to be close to the TSSs
(within ± 2.5kb) of more than one genes, and therefore, they would be tested for all these corresponding genes. All the DNA methyl-
ation data (beta values) used in the present study were generated by TCGA on the Infinium HumanMethylation450 BeadChip plat-
form, which covered 193,969 CpG sites allocated in the DNA promoter regions of 23,837 genes.
For each cancer type, genes with no or low expression in more than half of the tumor samples (median of normalized read
counts < 10) were discarded. The CpG sites with no methylation readout (beta value) in more than a quarter of the tumor samples
were removed. If any tumor sample bears somatic mutation(s) at a CpG site, the sample was removed from the methylation profile
of this particular CpG site for thewhole study. However, it was also noted that such cases are extremely rare (within each cancer type,
fewer than 0.5% of the promoter CpG sites are mutated in at least one tumor sample). Statistics of the genes and CpG sites that
passed the filters are provided in (Liu et al., 2018).
ChIP-seq data collection and analysisChIP-seq peak files for 161 different TFs were collected from the ENCODE Consortium (Neph et al., 2012b). We only used data from
cancer cell lines in ENCODE, since the current study has been focused on the DNA methylome-mediated regulations in cancers.
Some TFs were profiled by ChIP-seq in multiple cancer cell lines, but mostly no more than 3. In cases of multiple ChIP-seq files
for one TF in one or more cell types, the files of the same TF were merged by taking the largest signal value of the ChIP-seq peaks
at regions where the peaks overlap.
Genome-wide identifications of the TF-target regulatory circuits modulated by CpG site methylationThe overall pipeline for genome-wide identification of the TF-target regulatory circuits that are modulated by CpG site methylation is
illustrated in Figure 1A. The dynamic range of the data is essential for the statistical analysis used in our methodology, so in each
cancer type, we removed the promoter CpG sites with relatively stable methylation levels across tumors (standard deviation of
the beta values < 0.1), as well as poorly (normalized read count < 10) or stably (coefficient of variation of the read counts < 1.5) ex-
pressed genes. Filtering of the promoter CpG sites and the genes has been done and published before (Liu et al., 2018).
To evaluate the overall false discovery rates in the later steps, we generated artificial data entries that are true negatives. Specif-
ically, we randomly picked 10% of the genes and sample-shuffled their promoter CpG methylation and CNV profiles. Their gene
expression profiles were intact without sample permutation. These simulated data entries were spiked-in to the real datasets of
methylation, CNV, and gene expression, accordingly. Theoretically, any positive hit in the later steps involving these artificial CpG
or CNV entries would be a false discovery.
In the present study, for finding specific types of regulation in the usually multiplex and non-linear regulatory relationships, we used
the concepts of mutual information and cMI to assess the associations between multiple factors. We adopted the Kraskov approach
for estimation of the mutual information, which has been widely used in multiple areas (Frenzel and Pompe, 2007; Misra et al., 2015;
Reshef et al., 2011). Because DNA copy number is also a major defining factor of the gene expression, we designed a 2-step pro-
cedure (Figure 1A). The first step collects the TF-target gene associations that are modulated by either CpG methylation, DNA copy
number, or both, and the second step filters the TF-target combinations further in a largely reduced search space for the transcrip-
tional regulation events that involve CpG methylation.
We first put together a list of TFs with gene annotation information from Gene Ontology database, NCBI (Entrez) Gene database,
and several previous publications. More than 1800 genes were finally annotated as TFs or putative TFs. Specifically, in the first step,
for any possible combination of a TF i (TFi), a potential target gene (Genej) j and its DNA copy number (CNVj), as well as a CpG site m
within the promoter region of the target gene (CpGjm), we calculated the cMI between the TF and the target genewith theCpGmethyl-
ation and the DNA copy number of the target gene as the condition: cMIðTFi;Genej��CNVj;CpGjmÞ (Kraskov et al., 2004). The mutual
information (MI) between the TF and the target geneMIðTFi;GenejÞwas also calculated. Because the absolute value of cMI depends
on the MI, it is inappropriate to directly compare the cMIs from different TF-gene combinations. Therefore, we adapted the method-
ology used by theMINDy algorithm (Wang et al., 2009a, 2009b) to derive a P value-estimating function, which evaluates the statistical
significance of the additional information brought by the two conditions (CNVj;CpGjm) as a whole based on both values of MI and cMI.
Briefly, we first randomized the sample orders of the CpG methylation and CNV profiles, and then we calculated the MIðTF;GeneÞand cMIðTF;GenejCNV ;CpGÞ for 10,000 randomly picked TF-gene pairs. These 10,000 cMI/MI entries calculated from the random-
ized data were then sorted and evenly allocated into 100 bins according to the MI values. This therefore generated a null distribution
of the cMIs in each bin for a small range of the MIs. Next, from the real data, each TF-gene pair can be allocated in one of these
e2 Cell Reports 26, 3461–3474.e1–e5, March 19, 2019
100 bins according to the value of MIðTFi;GenejÞ, and then the P value of each cMIðTFi;Genej��CNVj;CpGjmÞ can be estimated by
comparison with the null distribution of the particular bin. Finally, the P value cut-off was set so that only 1% of the cMI entries
involving the true negative spike-ins in the original data could pass through (Figure 1A). Eventually, 10%–20%of all the possible com-
binations ðTFi;Genej��CNVj;CpGjmÞ genome-wide passed this first filter, resulting in a large reduction of the search space for the
following step.
The second step of our pipeline was set to further identify the cMIðTFi;Genej��CNVj;CpGjmÞ entries that significantly depend on the
CpGmethylation (rather than the CNV only). Specifically, the TF-gene-CpG combinations (triplets) after the first filter were challenged
by sample shuffling of the DNA methylation profiles. For each triplet, a null distribution was established from the values of
cMIðTFi;Genej��CNVj;CpGjmÞ calculated with the methylation data that was randomized 1000 times. Due to the high computational
cost for estimations of mutual information during the analysis at such a huge scale, the number of sample permutations was set to
1000 as a balance between computational cost and resolution of FDR estimation. The final P value can therefore be estimated by
comparing the real cMIðTFi;Genej��CNVj;CpGjmÞ with this null distribution. The final P value cut-off was set to be 0.05, i.e., no
more than 50 out of the 1000 sample-permutated cMIs are larger than the cMI from the real data, for each triplet (Figure 1A). This
process ensures that in the final collection of the TF-gene-CpG combinations (triplets), the DNA methylation profile is truly a deter-
mining factor for the cMIðTFi;Genej��CNVj;CpGjmÞ. In fact, with this cut-off, almost all of the 1% true negative spike-ins that passed the
filter in the first step were discarded, suggesting that the P value cut-off is a reasonable choice.
Finally, we suspect that there could be some indirect TF-target associations in the results. Specifically, for example, if TF1
regulates TF2, which directly regulates its target gene and a promoter CpG site was involved, then it is possible that our pipeline
will identify two circuits, TF1-target and TF2-target, both depending on the same CpG site. Here, the TF1-target circuit would be
a false positive (an indirect regulation circuit) and should be removed. Therefore, we went through the results from step 2, and
removed these potential false positives due to indirect TF-target regulations. Approximately 10% of the previous identified circuits
were removed.
The same procedure was performed with the datasets from each of the 21 cancer types, and the results are summarized as 21
cancer context-specific DNA MeTRNs.
Implementation of the pipelineThe whole pipeline consists of several sequential tasks, including 1) data preparation, 2) MI and cMI calculation, 3) filter 1 for the TF-
target-CpG combinations, and 4) filter 2 for the MeTRN triplets. All the computational tasks were implemented with MATLAB, C++,
and java. All the scripts organized as a pipeline can be downloaded from our website (http://labyang.com/data/MeTRN_scripts.rar).
Functional enrichment analysisFor each type of cancer, the genes that are heavily dependent on the DNA methylation-modulated transcriptional circuitry (R2 from
the ordinary linear regression model Target �TF + CpG greater than 0.4; see the data provided in Data S3), and the genes highly
dependent on the CNV (R2 from the model Target �CNV greater than 0.4; see the data provided in Data S3) were selected. Gene
Ontology (GO), Reactome, and KEGG enrichment analyses were conducted with different gene sets using the online tool Metascape
(http://metascape.org). The terms of biological processes that are significantly enriched in the gene sets of each cancer type were
collected and arranged according to their enrichment patterns across different cancer types. The list of cancer-related genes (1265 in
total) was obtained from the NCBI (Entrez) Gene database.
Patient classification and survival analysisIn each MeTRN, the genes that are strongly regulated by the TF-CpG circuits (R2 from the linear combination model Target �TF +
CpG greater than 0.5) were selected, and the CpG sites and the TFs taking parts in the regulatory circuits of each MeTRN were
collected as patient classifiers. The TF expressions (TPM values) were linearly scaled to [0, 1], the same as the DNA methylation
data (beta values). In each cancer type, the methylation profiles of the CpG sites and the expressions of the TFs were used for an
unsupervised hierarchical clustering of the patients. We have used an unsupervised hierarchical tree cutting strategy to determine
the optimal number of clusters k before the survival analysis. Specifically, we used the method of Silhouette-based cutting, which
has been frequently used for determination of cluster numbers based on high-throughput omics data (Cancer Genome Atlas
Network, 2012, 2015; Cancer Genome Atlas Research Network 2012, 2014; Ranzani et al., 2015; Sathyamurthy et al., 2018). In
most of these studies, the numbers of patient subgroups were eventually determined based on the clustering results and with the
help of prior knowledge about the cancer subtypes, prognostic differences, and evidence from other sources. It appears that for
most of the cancers, more than 3 subtypes have been defined by TCGA. Therefore, we set the k to be greater than 3, and used a
Silhouette-based approach to determine the optimal number of subtypes. This generalized strategy for all 21 cancers may not yield
the best result for some particular cancer types. However, the purpose of the clustering analyses was just to show that coupled TFs
and CpG sites can be used to define prognostically different patient subgroups. Cancer type-specific survival analyses were then
performed for each of the patient subgroups resulted from the clustering analysis. The Kaplan–Meier estimator and the log-rank
test method was used for statistical assessments of the survival curves, and correction for multiple testing was done with two stage
linear step-up procedure when applicable method. The R package ‘survival’ was used for comparison of the patient survival rates
among different subgroups.
Cell Reports 26, 3461–3474.e1–e5, March 19, 2019 e3
Differential expression and differential methylation analysisPaired Student’s t test was used to assess the mRNA differential expression of the genes in different sample groups. Default argu-
ments for t.test() in R were used, which provided estimates of the statistical significance (P value) of the differential expression.
A volcano plot was generated to illustrate the expression fold change and the adjusted P value for each gene.
Wilcoxon signed-rank test (also known as paired Mann-Whitney test) was used to assess the statistical significance of the differ-
ence at the level of methylation, in different groups of tumor samples, for each CpG site. The exact p value was computed instead of a
normal approximation. The medians of methylation differences (tumor – normal) were also computed. A volcano plot was generated
to illustrate the median of the methylation difference and the adjusted P value for each CpG site.
QUANTIFICATION AND STATISTICAL ANALYSIS
Software and statistical analysisDuring the following analysis, we mainly use Python 3 (version 3.6), R (version 3.3) and Scala (version 2.11.8) as the primary tools.
Figure 2
Values are extracted directly from the triplet files (Data S1) of 21 cancer types.
Figure 3
A and B. The number of data points for each box is the line number of the corresponding file in Data S3. The linear regression is done
using Ordinary Least-squares (OLS) (https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) approach in a Py-
thon package ‘‘Statsmodels.’’ Due to the fact that some target genes are regulated by too many TFs and/or CpGs, the OLS method
might fail due to samples are fewer than variables (TFs and CpG sites). If a target gene has invalid status in any of ‘‘Target�TF,’’ ‘‘Tar-
get�CpG,’’ ‘‘Target�TF+CpG,’’ ‘‘Target�TF+CpG+CNV’’ linear regression models, it would be excluded in the figure. In the figure,
the point in a box represents the mean. Points outside the box and whiskers are outliners.
C. We select target genes with the adjusted R-squared value of ‘‘Target�TF+CpG’’ (MeTRN) model greater than 0.4 and submit
them to Metascape (http://metascape.org/) to perform ‘‘Express analysis’’ and retrieved the outcome. A python script is used to
perform row and column reorganization of the result matrix. But the value is not modified.
Figure 4
A. Only cancer-related genes are shown in this figure. The data is from Figure 3A.
B. We sorted cancer-related target genes with the median of their adjusted R-squared value across 21 cancer types in descending
order and selected 10 genes of interest among the top.
C and D. We directly count the triplet file (Data S1) to get the values.
Figure 5
A. Adjusted R-squared values are described in Figure 3. The similarity between cancers are calculated by using R-squared values of
genes presenting in both cancers.
Figure 6
A. The clustering for rows and columns is done by using Euclidean distance andWard linkage function. Gene expressions are normal-
ized to [0, 1] by function f(x) = (x - A) / (B - A) where A and B are minimum and maximum value of the expression values of that gene
across samples. Python package ‘‘Scipy’’ is used to calculate distances and linkages. The clusters are cut based on maximum value
of silhouette with the restriction that cluster should be more than 3 but less than 10. Silhouette is calculated using ‘‘sklearn.metrics.
silhouette_score’’ function from Python package ‘‘scikit-learn.’’
B, C, D. Kaplan-Meier estimator is used to generate survival function. Log-rank test is used to compare the survival distributions
between two clusters. Cluster size: N(k1) = 175, N(k2) = 55, N(k3) = 10, N(k4) = 32. Python package ‘‘lifelines’’ is used to perform the
survival analysis.
Figure 7
A. Differential analysis of CpG sites and genes. For CpG sites, the difference is calculated by subtracting the median of methylation
beta values in samples of cluster k3 (n = 10) by the median of methylation beta values in samples of cluster k1 (n = 175). The fold
change of TFs is calculated by dividing the median of gene expression values of cluster k3 by that of cluster k1. The p values
here are calculated by the Wilcoxon rank-sum test (python package Scipy, method: ‘‘scipy.stats.ranksums’’).
C. GSEA 3.0 from Broad Institute is used to perform gene set enrichment analysis of target genes. We set ‘‘Metric for ranking
genes’’ to ‘‘tTest’’ and ‘‘Collapse dataset to gene symbols’’ to false. All remaining critical arguments for GSEA are untouched.
D. We used Metascape to perform the Gene Ontology term enrichment. The number of genes is shown in C.
Figure S1
A. Violin plot created by Python package ‘‘seaborn.’’ The dashed lines are 75%, 50% and 25%percentiles. The plot is generated with
‘‘cut’’ argument set to 0 to avoid violinplot to extend beyond the range of underlining data.
B. Each point represents a TF. The color of it reflects the number of targets of this TF. The value of the point reflects the percentage
of predicted target genes which are present in ENCODE ChIP-seq data of the corresponding TF.
C. For each TF (point), we performWilcoxon ranksum test to detect the difference of ChIP-seq signal between CpG sites in MeTRN
and those that are not in MeTRN. Due to the sample size is relatively large and the distribution of samples is far from known distri-
butions, we choose this non-parametric hypothesis test to detect the difference.
e4 Cell Reports 26, 3461–3474.e1–e5, March 19, 2019
Figure S2
A. Each box in the figure contains the same amount of target genes as Figure 3A. Linear regression models with different predictor
variables were used to fit the expression profile of each target gene in the MeTRN for each cancer type. The coefficient of determi-
nation (R2) for each gene was calculated from the model of Target �TFs + CpG sites, Target �TFs, or Target �CpG with data from a
particular cancer type. Finally, boxplots were prepared to show the distributions of the R2 values of all the genes with these 3 different
models for each cancer type.
B. The hypergeometric test is used to calculate the enrichment score. Due to different cancer have a different set of target genes.
the intersection of all possible target genes of both cancer type is acquired first as the population. Then, we treat target genes in one
cancer type as the ‘‘successes,’’ and the size of the target genes of another cancer type as the size of a draw.
C, D. They are generated in the same way as Figure 3C
Figure S3
These figures are drawn in the same way as Figure 3B except that they show the values of target genes from the MeTRNs of other 20
cancer types.
Figure S4-S7
The heatmaps of clustering results and survival analysis are generated in the same way as Figure 6.
Strategy for randomizationIn this study, we used the uniform distribution to randomize the order of samples. No stratification is applied in this study, all samples
of a cancer type are treated equally.
Data exclusionWe remove genes and CpG sites in the dataset if they do not meet specific criteria. Those criteria are described in ‘‘Data collection
and pre-processing’’ in the Method Details section.
We removed target genes that cannot perform ordinary least square linear regression in the R-squared value related analysis.
In survival analysis, we excluded all samples (patients) whose clinical data are unavailable.
DATA AND SOFTWARE AVAILABILITY
All the analysis results and datasets in this article are being provided as supplemental data files (Data S1, S2, S3, S4, and S5). The
algorithms used in this article can be downloaded from http://labyang.com/data/MeTRN_scripts.rar. A web-based searchable data
portal of the MeTRNs is available at http://labyang.com/resources/metrn/table.html.
‘‘Triplets’’ in the MeTRN networks of 21 cancers: Data S1
Target genes of the cancer-related TFs in 21 MeTRNs: Data S2
Genes sorted by the coefficient of determination (R2) from the linear combination model (Target �TF + CpG) or by the R2 from the
model (Target �CNV): Data S3
Dependencies of the cancer genes on the CNV or the DNA methylation-modulated transcriptional regulation circuits in the
MeTRNs: Data S4
Differential expression of the 598 target genes, in k3 versus k1 of KIRP, of the selected TFs and CpG sites in the KIRP MeTRN:
Data S5
Cell Reports 26, 3461–3474.e1–e5, March 19, 2019 e5
Cell Reports, Volume 26
Supplemental Information
Dependency of the Cancer-Specific Transcriptional
Regulation Circuitry on the Promoter DNA Methylome
Yu Liu, Yang Liu, Rongyao Huang, Wanlu Song, Jiawei Wang, Zhengtao Xiao, ShengchengDong, Yang Yang, and Xuerui Yang
1
SUPPLEMENTARY FIGURES AND TABLES
Dependency of the cancer-specific transcriptional regulation circuitry on the
promoter DNA methylome
Yu Liu1-4,# , Yang Liu1,3-5,#, Rongyao Huang1,3,4,#, Wanlu Song1-4, Jiawei Wang1,3,4, Zhengtao Xiao1-4,
Shengcheng Dong1,3,4, Yang Yang1,3,4, and Xuerui Yang1-4,*
1MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
2Tsinghua-Peking Joint Center for Life Sciences, Beijing 100084, China
3Center for Synthetic & Systems Biology, Tsinghua University, Beijing 100084, China
4School of Life Sciences, Tsinghua University, Beijing 100084, China
5Joint Graduate Program of Peking-Tsinghua-National Institute of Biological Science, Tsinghua
University, Beijing 100084, China.
# These authors contributed equally.
* Correspondence should be addressed to X.Y. ([email protected], +86-10-62783943).
2
SUPPLEMENTARY FIGURES
Figure S1
Figure S1. Detailed statistics and cross-validations of the 21 cancer type-specific MeTRNs. Related to Figure
1.
(A) Violin plots, for each cancer type-specific MeTRN, showing distributions of the numbers of target genes of the
TFs. The information about the CpG sites involved in the TF-target gene circuits was not incorporated. Taking the
CpG sites into account, Violin plots show distributions of the numbers of promoter CpG-gene pairs for the TFs. (B)
Validation rates of the TF targets predicted by the MeTRNs. Each dot represents a TF and shows the percentage of
Supplementary Figure 1
A
B
-log10(a
dju
ste
d P
v)
Ratio (log2) of TF-specific ChIP-seq signals on
MeTRN CpG over non-MeTRN CpG sites
C
3
the predicted targets in a MeTRN that are supported by the ChIP-seq data of that particular TF. The color of each
dot represents the number of predicted targets of the particular TF in the MeTRN. (C) Each dot represents a TF and
shows the difference in the ChIP-seq peak signals spanning the promoter CpG sites of the target genes predicted by
the MeTRNs for that particular TF, compared to the peak signals spanning the other promoter CpG sites of the same
groups of genes but not included in the MeTRNs. Log2 of the ratios (average ChIP-seq peak signals spanning the
CpG sites in the MeTRNs over the average signals spanning the sites not in the MeTRNs) and the statistical
significance (-log10 of the P-values with a t-test) of the differences are provided on the X-axis and Y-axis,
respectively.
4
Figure S2
Supplementary Figure 2
Target~TF Target~ CpG Target~TF+CpG
46
Overlapped genes
-log10(Pv) of the overlap
840
340
0
0
ES
CA
SK
CM
TH
CA
LIH
CL
US
CP
CP
GS
TA
DC
ES
CS
AR
CP
AA
DL
AM
LK
IRC
HN
SC
CO
AD
LG
GU
CE
CK
IRP
LU
AD
BL
CA
BR
CA
TG
CT
cell-substrate adhesion
small GTPase mediated signal transduction
positive regulation of hydrolase activity
actin filament-based process
cell-cell adhesion
negative regulation of cell differentiation
establishment or maintenance of cell polarity
bone development
digestive tract development
positive regulation of kinase activity
cell morphogenesis involved in differentiation
retina development in camera-type eye
T cell activation
regulation of lymphocyte activation
antigen processing and presentation of peptide antigen via MHC class Ib
lymphocyte homeostasis
cranial nerve morphogenesis
cell fate commitment
morphogenesis of an epithelium
regionalization
-log10(Pv)
0 3 6 10 20
C
D
A
B
5
Figure S2. Prediction powers of the MeTRNs for gene expression profiles and comparison with Target ~ CpG
and Target ~ TF models. Related to Figure 3.
(A) Linear regression models with different predictor variables were used to fit the expression profile of each target
gene in the MeTRN for each cancer type. The coefficient of determination (R2) for each gene was calculated from
the model of Target ~ TFs + CpG sites, Target ~ TFs, or Target ~ CpG with data from a particular cancer type.
Finally, box plots were prepared to show the distributions of the R2 values of all the genes with these 3 different
models for each cancer type. (B) The numbers of genes that are heavily regulated via the methylation-dependent
transcriptional machinery (R2 from the linear combination model Target ~ TF + CpG greater than 0.4, from Data S3)
in each cancer type are shown along the diagonal of the rotated matrix. These genes were then compared between
each pair of cancers; the number (upper triangle) and the P-value (lower triangle) of the overlap are given in the
matrix. (C) The biological and physiological processes enriched in the top genes that are highly dependent on the
CpG sites in the MeTRNs (R2 of Target ~ CpG > 0.4). (D) The biological and physiological processes enriched in
the top genes that are highly dependent on the TFs in the MeTRNs (R2 of Target ~ TF > 0.4). Saturation of the color
indicates the statistical significance (-log10(Pv)) of each term.
6
Figure S3
Figure S3. Scatter plots of the coefficient of determinations (R2) from different linear regression models.
Related to Figure 3.
Scatter plots, which are similar to that shown in Fig. 3B, of all the other 20 cancer types.
R2
of
Targ
et~
TF
+C
pG
R2 of Target~CNV
0
0.2
0.4
0.6
0.8
1
Supplementary Figure 3
7
Figure S4
LG
G
PA
AD
BR
CA
SA
RC
LA
ML
KIR
C
LIH
C
SK
CM
CE
SC
LU
SC
HN
SC
PC
PG
TH
CA
ES
CA
UC
EC
BLC
A
LU
AD
TG
CT
ST
AD
CO
AD
TFs
CpGs
Tumors0 1
Methylation or TF levelk1 k2 k3 k4 k5 k6 k7 k8 k9
Supplementary Figure 4
8
Figure S4. Cancer type-specific classifications of the patients based on the coupled TFs and CpG sites in the
MeTRNs. Related to Figure 6.
For each of the 20 cancer types, unsupervised hierarchical clustering analysis was performed with the TF expression
and CpG site methylation profiles of the DNA methylation-modulated transcriptional regulation circuits in the
MeTRN that exhibited strong prediction power for the target gene expression profiles, as shown with the linear
combinatory model Target ~ TF + CpG. The method of Silhouette-based cutting was applied to determine the
optimal number of subgroups for each cancer type. The patient subgroups resulted from the clustering analysis were
marked with different colors. The 20 cancers were sorted by the significance levels (P-values) of the prognostic
difference between the subtypes.
9
Figure S5
p(k2,k4)=4.4955e-14
LGG
p(k1,k3)=9.3001e-04
PAAD
p(k2,k3)=1.4740e-03
BRCA
p(k1,k2)=2.1034e-03
SARC
p(k2,k7)=3.2541e-03
LAML
p(k2,k4)=4.7267e-03
KIRC
p(k1,k4)=9.3813e-03
LIHC
p(k3,k4)=1.5047e-02
SKCM
p(k1,k2)=2.1601e-02
CESC
p(k2,k4)=6.5707e-02
LUSC
p(k1,k3)=7.0604e-02
HNSC
p(k2,k4)=1.3760e-01
PCPG
p(k1,k4)=1.6704e-01
THCA
p(k1,k3)=2.0227e-01
ESCA
p(k2,k4)=2.7323e-01
UCEC
p(k1,k2)=2.9161e-01
BLCA
p(k1,k4)=3.1344e-01
LUAD
p(k2,k3)=3.1731e-01
TGCT
p(k3,k4)=3.7629e-01
STAD
p(k1,k2)=6.3286e-01
COAD
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
k1 k2 k3 k4 k5 k6 k7 k8 k9
0 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month
0 10 20 30 40 50 60
Month
Ove
rall
Su
rviv
al
Ove
rall
Su
rviv
al
Ove
rall
Su
rviv
al
Ove
rall
Su
rviv
al
Ove
rall
Su
rviv
al
Supplementary Figure 5
10
Figure S5. Survival curves of the cancer subtypes identified by the MeTRN regulators. Related to Figure 6.
Cancer type-specific Kaplan–Meier survival curves showing comparisons of overall survival between different
subgroups of patients, which were classified based on the MeTRN regulators and shown in Fig. S4. P-values for the
statistical significance of the largest prognosis difference in each cancer were inferred with log-rank tests. The 20
cancers were organized in the same order as shown in Fig. S4.
11
Figure S6
Supplementary Figure 6
p(k1,k3)=1.0353e-11
LGG PAAD BRCA
p(k1,k5)=2.7874e-03
SARC
p(k1,k2)=1.6172e-01
LAML
p(k2,k4)=3.4149e-04
KIRC
p(k2,k4)=3.3632e-01
LIHC SKCM
p(k2,k4)=2.1546e-02
CESC
p(k2,k4)=7.3213e-02
LUSC HNSC
p(k2,k3)=5.2358e-01
PCPG
p(k1,k4)=1.8808e-01
THCA ESCA UCEC
p(k2,k4)=3.7358e-02
BLCA
p(k1,k2)=1.1829e-01
LUAD
p(k1,k2)=2.8350e-01
TGCT
p(k1,k4)=2.1362e-01
STAD
p(k1,k2)=1.1193e-01
COAD
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
Month0 10 20 30 40 50 600 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month
k1 k2 k3 k4 k5 k6 k7 k8 k9
p(k3,k4)=1.4078e-03 p(k1,k2)=7.3507e-02
p(k3,k4)=4.4643e-01
p(k2, k3)=7.0668e-02
p(k1,k2)=5.0251e-01 p(k1,k2)=4.4931e-01
12
Figure S6. Survival curves of the cancer subtypes identified by the top variable CpG sites. Related to Figure
6.
Cancer type-specific Kaplan–Meier survival curves showing comparisons of overall survival between different
subgroups of patients, which were classified based on the most highly variable CpG sites. P-values for the statistical
significance of the largest prognosis difference in each cancer were inferred with log-rank tests. The 20 cancers were
organized in the same order as shown in Fig. S4.
13
Figure S7
Supplementary Figure 7
p(k1,k4)=2.5282e-11
LGG
p(k1,k4)=3.1144e-04
PAAD
p(k1,k2)=1.1346e-02
BRCA SARC
LAML KIRC
p(k1,k4)=9.3607e-02
LIHC
p(k1,k2)=1.6055e-02
SKCM
p(k1,k2)=7.3657e-02
CESC
p(k2,k4)=2.2146e-02
LUSC
p(k1,k4)=3.6606e-02
HNSC
p(k3,k4)=1.3167e-01
PCPG
p(k1,k2)=5.5357e-02
THCA ESCA UCEC BLCA
LUAD
p(k3,k4)=2.4150e-01
TGCT
p(k3,k4)=3.4325e-01
STAD COAD
Month0 10 20 30 40 50 600 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month0 10 20 30 40 50 60
Month
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
0
0.2
0.4
0.6
0.8
1
Ove
rall
Su
rviv
al
k1 k2 k3 k4 k5 k6 k7 k8 k9
p(k3,k4)=1.5175e-02
p(k3,k4)=9.4658e-02 p(k2,k4)=3.5353e-03
p(k2, k4)=1.0985e-01 p(k3, k4)=1.9856e-01
p(k2, k3)=7.9838e-02p(k3, k4)=1.2537e-01 p(k1, k2)=6.3928e-01
14
Figure S7. Survival curves of the cancer subtypes identified by TFs. Related to Figure 6.
Cancer type-specific Kaplan–Meier survival curves showing comparisons of overall survival between different
subgroups of patients, which were classified based on expressions of TFs with strong prediction powers for target
gene expression. The TF-target associations were inferred with ARACNE, which does not take into account the
DNA methylome profiles. P-values for the statistical significance of the largest prognosis difference in each cancer
were inferred with log-rank tests.
15
SUPPLEMENTARY TABLES
Table S1. Sample numbers of the datasets from 21 cancer types in TCGA. Related to Figure 1.
Tumor samples
Cancer Type* Me mRNA CNV Overlap SNP
BLCA bladder urothelial carcinoma 411 406 407 402 396
BRCA breast invasive carcinoma 737 1099 1086 726 987
CESC
cervical squamous cell
carcinoma and endocervical
adenocarcinoma
309 306 297 294 194
COAD colon adenocarcinoma 285 286 454 265 268
ESCA esophageal carcinoma 186 185 185 184 184
HNSC head and neck squamous cell
carcinoma 530 516 524 510 509
KIRC kidney renal clear cell carcinoma 320 532 529 312 417
KIRP kidney renal papillary cell
carcinoma 276 291 288 272 161
LAML acute myeloid leukemia 194 173 191 163 0
LGG brain lower grade glioma 530 528 527 525 530
LIHC liver hepatocellular carcinoma 379 373 372 366 198
LUAD lung adenocarcinoma 451 513 518 441 543
LUSC lung squamous cell carcinoma 359 501 501 356 177
PAAD pancreatic adenocarcinoma 185 179 185 178 147
PCPG pheochromocytoma and
paraganglioma 184 184 166 166 183
SARC sarcoma 245 261 261 237 259
SKCM skin cutaneous melanoma 465 474 471 456 364
STAD stomach adenocarcinoma 395 415 441 370 379
TGCT testicular germ cell tumors 156 156 150 150 155
THCA thyroid carcinoma 515 513 512 510 405
UCEC uterine corpus endometrioid
carcinoma 432 174 540 172 248
Total: 7544 8065 8605 7055 6704
*Note: 12 out of 33 types of cancer in TCGA were not included in the present study due to limited numbers of
tumor samples with data required for the analyses. These missing cancers are ACC, CHOL, DLBC, KICH, GBM,
MESO, OV, PRAD, READ, THYM, UCS, UVM.
16
Table S2. Statistics of the 21 cancer context-specific MeTRNs. Related to Figure 1.
Cancer
Type Triplets TFs CpG sites Target genes TF-Target
CpG-
Target
BLCA 55163 1283 16701 5029 31773 17053
BRCA 67174 1363 20042 5660 42036 20545
CESC 74993 1301 20869 6068 48104 21481
COAD 47030 1375 10174 4288 43741 10300
ESCA 36594 1184 13372 5242 33587 13628
HNSC 66102 1346 19707 5695 32924 20248
KIRC 39416 1372 13759 5095 32791 14177
KIRP 54481 1321 15511 5252 38913 16021
LAML 25233 1263 10293 4384 23829 10583
LGG 57696 1380 19061 6340 36272 19725
LIHC 64274 1260 17635 5245 40498 18015
LUAD 58763 1341 16782 4990 35213 17181
LUSC 68522 1375 12064 4778 62270 12322
PAAD 32450 1415 13056 4434 23320 13300
PCPG 38298 1367 11721 4899 36445 11900
SARC 85445 1286 22115 6453 55765 22816
SKCM 89500 1291 19585 5611 52883 20162
STAD 78465 1354 24114 6233 47005 24755
TGCT 88692 1415 21927 6434 86890 22457
THCA 52042 1369 15116 5988 48378 15689
UCEC 49578 1297 17154 5230 32666 17584