computational prioritization of cancer driver
TRANSCRIPT
COMPUTATIONAL PRIORITIZATION OF CANCER DRIVERGENES FOR PRECISION ONCOLOGY
by
RAUNAK SHRESTHA
B.Tech. Kathmandu University, 2009
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
in
THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES
(Bioinformatics)
THE UNIVERSITY OF BRITISH COLUMBIA
(Vancouver)
August 2018
c© RAUNAK SHRESTHA, 2018
The following individuals certify that they have read, and recommend to the Faculty of Graduate and
Postdoctoral Studies for acceptance, the dissertation entitled:
Computational Prioritization of Cancer Driver Genes for Precision Oncology
submitted by Raunak Shrestha in partial fulfillment of the requirements for
the degree of Doctor of Philosophy
in Bioinformatics
Examining Committee:Dr. Colin C. Collins, Urologic Sciences
Supervisor
Dr. S. Cenk Sahinalp, Computer Science
Co-supervisor
Dr. Artem Cherkasov, Urologic Sciences
Supervisory Committee Member
Dr. David G. Huntsman, Pathology and Laboratory Medicine
University Examiner
Dr. Leonard Foster, Biochemistry and Molecular Biology
University Examiner
Additional Supervisory Committee Members:Dr. Yuzhuo Wang, Urologic Sciences
Supervisory Committee Member
Dr. Wan Lam, Pathology
Supervisory Committee Member
ii
Abstract
Advances in high-throughput sequencing technologies has drastically increased the efficiency to access
different alterations in the genome, transcriptome, proteome, and epigenome of a cancer cell. This has
increased the computational burden to analyze these “big data” making the translation of the knowledge
into insightful and impactful patient outcomes extraordinarily challenging.
Among these alterations, only a few “driver” alterations are expected to confer crucial growth advan-
tage. These are greatly outnumbered by functionally inconsequential “passenger” alterations. This poses
a significant challenge for the identification of driver alterations, requiring solutions to novel algorithmic
problems. Although, the insight on driver alterations is critical to guide selection of appropriate drug
therapies for the patient, no specific tools exist to help clinicians contextualize the enormous genomic
information when making therapeutic decisions.
In this thesis we describe novel algorithms for the identification and prioritization of cancer driver
genes. First we describe, HIT’nDRIVE, a combinatorial algorithm measuring the impact of genomic
aberration to global changes of gene expression pattern to prioritize cancer driver genes. We also demon-
strate its application on large multi-omics cancer datasets to guide precision oncology. We further de-
scribe integrative multi-omics characterization of peritoneal mesothelioma, a rare cancer of abdomen.
Here using HIT’nDRIVE, we identified peritoneal mesothelioma with BAP1 loss to form a distinct
molecular subtype characterized by distinct gene expression patterns of chromatin remodeling, DNA
repair pathways, and immune checkpoint receptor activation. We demonstrate that this subtype is cor-
related with an inflammatory tumor microenvironment and thus is a candidate for immune checkpoint
blockade therapies. Finally, we describe, cd-CAP, a combinatorial algorithm to identify subnetworks
with conserved molecular alteration pattern across a large subset of a tumor sample cohort. Notably, we
demonstrate that many of the largest highly conserved subnetworks within a tumor type solely consist of
genes that have been subject to copy number gain, typically located on the same chromosomal arm and
thus likely a result of a single, large scale copy number amplification.
iii
Lay Summary
Cancer arises as a result of deleterious aberrations on the genetic material and its product. The compo-
nents of the genetic material interact with each other forming extremely complex web of networks. The
accumulation of abnormalities in the genetic material results in perturbation of critical networks which
may ultimately give rise to tumor. Although many alterations accumulate in a tumor over its lifetime,
only a small fraction, known as “driver” alterations, are critical for tumor growth, while the majority of
“passenger” alterations are not essential. Identification of driver alterations in the vast milieu of passen-
ger alterations is a challenging task, but is critical for optimal cancer management.
In this thesis, we describe novel computational method using advanced mathematics and computer
science techniques to address the problems mentioned above. Here we demonstrate, how our compu-
tational tools establish linkage between driver alterations and tumour viability thus revealing novel bio-
logical insights to therapeutic strategies. This will guide the selection of appropriate anti-cancer drugs
and development of new ones. Thus we believe, this work will accelerate translation from discovery to
effective cancer treatment.
iv
Preface
In conjunction with my advisors, Dr. Colin C. Collins and Dr. S. Cenk Sahinalp, I was involved in the
conceptualization and design of research activities described in the thesis. In particular, I was designed,
developed, and implemented the computational algorithms described in this thesis. I performed majority
of data analysis for the molecular characterization of malignant peritoneal mesothelioma. I performed
the computational experiments, data analysis, and generation of figures, tables, and text in this thesis.
Where there are exceptions, they are noted below.
Chapter 1 was written by me.
Majority of the Chapter 2 and 3 was written by me. The HIT’nDRIVE algorithm development
was done in collaboration with Mr. Ermin Hodzic, Dr. Gholamreza Haffari, and Dr. S. Cenk Sahi-
nalp. I performed majority of data analysis, and generated tables and figures. Certain portion of the
computational experiments were performed by Mr. Ermin Hodzic. Chapteres 2 and 3 has been pub-
lished in: R. Shrestha, E. Hodzic, J. Yeung, K. Wang, T. Sauerwald, P. Dao, S. Anderson, H. Beltran,
M. A. Rubin, C. C. Collins, G. Haffari, and S. C. Sahinalp. HIT’nDRIVE: Multi-driver gene priori-
tization based on hitting time. Research in Computational Molecular Biology: 18th Annual Interna-
tional Conference, RECOMB 2014, Pittsburgh, PA, USA, April 2-5, 2014, Proceedings, pages 293–306,
2014. doi:10.1007/978-3-319-05269-4 23. URL http://dx.doi.org/10.1007/978-3-319-05269-4 23 and
R. Shrestha, E. Hodzic, T. Sauerwald, P. Dao, K. Wang, J. Yeung, S. Anderson, F. Vandin, G. Haf-
fari, C. C. Collins, and S. C. Sahinalp. HIT’nDRIVE: patient-specific multidriver gene prioritiza-
tion for precision oncology. Genome research, 27(9):1573–1588, sep 2017. ISSN 1549-5469. doi:
10.1101/gr.221218.117. URL https://www.ncbi.nlm.nih.gov/pubmed/28768687. HIT’nDRIVE software
is available through the following url: https://github.com/sfu-compbio/hitndrive
Chapter 4 was written by me. I performed majority of data analysis, and generated tables and figures.
This work was performed in collaboration with Dr. Noushin Nabavi. Dr. Andrew Churg, Dr. Htoo
Zarni Oo, Dr. Antonio Hurtado-Coll, Dr. Ladan Fazli, and Ms, Estelle Li generated Tissue Microarray,
performed pathological slide staining and slide reviews. Dr. Noushin Nabavi, Mr. Hans H. Adomat,
v
Mr. Robert Shukin, Mr. Brian McConeghy, Ms. Anne Haegert, and Ms. Sonal Brahmbhatt performed
experiments and data generation. Dr. Yen-Yi Lin, Dr. Fan Mo, Dr. Stanislav Volik, Mr. Shawn Anderson,
and Mr. Robert H. Bell performed various computational experiments. This study was approved by the
Institutional Review Board of the University of British Columbia and the Vancouver Coastal Health
(REB Number. H1500902 and V15-00902). All samples and information were collected with written
and signed informed consent from the participating patients. The pre-print version of this chapter is
available at: R. Shrestha, N. Nabavi, Y.-Y. Lin, F. Mo, S. Anderson, S. Volik, H. H. Adomat, D. Lin,
H. Xue, X. Dong, R. Shukin, R. H. Bell, B. McConeghy, A. Haegert, S. Brahmbhatt, E. Li, H. Z. Oo,
A. Hurtado-Coll, L. Fazli, J. Zhou, Y. McConnell, A. McCart, A. Lowy, G. B. Morin, M. Daugaard, S. C.
Sahinalp, F. Hach, S. Le Bihan, M. E. Gleave, Y. Wang, A. Churg, and C. C. Collins. Integrated Multi-
omics Molecular Subtyping Predicts Therapeutic Vulnerability in Malignant Peritoneal Mesothelioma.
bioRxiv, 2018. doi:10.1101/243477. URL https://doi.org/10.1101/2434777
Chapter 5 was written by me. This work was done in collaboration with Mr. Ermin Hodzic and Mr.
Kaiyuan Zhu. I performed data preparation, developed algorithm, performed data analysis as well as
generated tables and figures. The pre-print version of this chapter is available at: E. Hodzic, R. Shrestha,
K. Zhu, K. Cheng, C. C. Collins, and S. C. Sahinalp. Combinatorial detection of conserved alteration
patterns for identifying cancer subnetworks. bioRxiv, 2018. doi:10.1101/369850. URL https://doi.org/
10.1101/369850 cd-CAP software is available through the following url:
https://github.com/ehodzic/cd-CAP
vi
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Lay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Cancer driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Computational methods for the prediction of cancer driver genes . . . . . . . . . . . . . 2
1.2.1 Identification of recurrent somatic alterations . . . . . . . . . . . . . . . . . . . 2
1.2.2 Prediction of functional impact of somatic alterations . . . . . . . . . . . . . . . 3
1.2.3 Pathway and interaction-network based approaches . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 HIT’nDRIVE: an algorithm for cancer driver genes prioritization using hitting time . . . 102.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vii
2.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 HIT’nDRIVE Algorithmic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Reformulation of RWFL as a Weighted Multi-Set Cover (WMSC) Problem . . . 13
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 HIT’nDRIVE parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 HIT’nDRIVE: expression outlier stringency . . . . . . . . . . . . . . . . . . . . 15
2.4.3 HIT’nDRIVE: random alterations and random expression outliers. . . . . . . . . 16
2.4.4 HIT’nDRIVE: network perturbation . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.5 HIT’nDRIVE: underlying network . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.6 Modified HIT’nDRIVE: when it is not required to prioritize at least one driver
gene per patient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.7 HIT’nDRIVE’s ability to capture CGC genes . . . . . . . . . . . . . . . . . . . 17
2.4.8 Correlation of predicted driver genes with alteration burden. . . . . . . . . . . . 18
2.4.9 Phenotype classification using dysregulated modules seeded with the predicted
driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.10 CGC cancer type-specific gene enrichment. . . . . . . . . . . . . . . . . . . . . 20
2.4.11 Phenotype classification using CGC gene seeded modules . . . . . . . . . . . . 20
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Datasets and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Interaction networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.3 Validation dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.4 Derivation of expression outlier genes . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.5 Derivation of expression outlier gene weights . . . . . . . . . . . . . . . . . . . 24
2.6.6 Statistical significance of the overlap of driver genes with that of CGC database. 25
3 Application of HIT’nDRIVE: patient-specific multi-driver gene prioritization for preci-sion oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 HIT’nDRIVE predicts frequent as well as infrequent driver genes in multi-omics
cancer datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Network properties of cancer driver genes . . . . . . . . . . . . . . . . . . . . . 37
viii
3.3.3 Breast cancer subtype classification using driver modules. . . . . . . . . . . . . 39
3.3.4 Subtype-specific breast cancer driver modules are associated with survival out-
come. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.5 HIT’nDRIVE seeded driver genes accurately predict drug efficacy . . . . . . . . 41
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Datasets and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Genomics of drug sensitivity in cancer . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.3 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.4 Association of driver modules with patients’ survival outcome . . . . . . . . . . 44
4 Integrated multi-omics molecular subtyping predicts therapeutic vulnerability in malig-nant peritoneal mesothelioma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Patient Cohort description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Landscape of somatic mutations in PeM . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Copy number landscape in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.4 Gene fusions in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.5 The global transcriptome and proteome profile of PeM . . . . . . . . . . . . . . 55
4.3.6 Transcriptional and post-transcriptional mechanisms regulate chromatin remod-
eling protein-complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.7 BAP1del subtype is characterized by distinct expression patterns of genes in-
volved in DNA repair pathway, and immune checkpoint receptor activation . . . 58
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.1 Clinical samples and pathology evaluation . . . . . . . . . . . . . . . . . . . . . 61
4.5.2 Construction of tissue microarrays (TMAs) . . . . . . . . . . . . . . . . . . . . 61
4.5.3 Immunohistochemistry and Histopathology . . . . . . . . . . . . . . . . . . . . 62
4.5.4 Whole exome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.5 Somatic variant calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.6 Copy number aberration (CNA) calls . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.7 Transcriptome sequencing (RNA-seq) . . . . . . . . . . . . . . . . . . . . . . . 64
ix
4.5.8 Transcriptome (RNA-seq) quantification . . . . . . . . . . . . . . . . . . . . . . 64
4.5.9 Identification of fusion transcripts and validation . . . . . . . . . . . . . . . . . 65
4.5.10 Proteomics analysis using mass spectrometry . . . . . . . . . . . . . . . . . . . 65
4.5.11 Peptide identification and protein quantification . . . . . . . . . . . . . . . . . . 66
4.5.12 Mutational signature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.13 Prioritization of driver genes using HIT’nDRIVE . . . . . . . . . . . . . . . . . 67
4.5.14 Consensus clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.15 Protein attenuation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.16 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.17 Stromal and immune score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.18 Enumeration of tissue-resident immune cell types using mRNA expression profiles 68
4.5.19 External datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Combinatorial detection of conserved alteration patterns for identifying cancer subnet-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Algorithmic Framework of cd-CAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.1 Combinatorial Optimization Formulation . . . . . . . . . . . . . . . . . . . . . 78
5.4.2 Algorithmic Framework for solving MCSC . . . . . . . . . . . . . . . . . . . . 80
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.2 Maximal Colored Subnetworks Across Cancer Types . . . . . . . . . . . . . . . 83
5.5.3 Maximal Colorful Subnetworks Across Cancer Types . . . . . . . . . . . . . . . 85
5.5.4 Multiple-Subnetwork Analysis Across Cancer Types . . . . . . . . . . . . . . . 85
5.5.5 Empirical P-Value Estimates Confirm the Significance of cd-CAP Identified Net-
works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7.1 Significance of the Identified Subnetworks . . . . . . . . . . . . . . . . . . . . 88
5.7.2 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7.3 Association of sub-networks with patients’ survival outcome . . . . . . . . . . . 89
x
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.1 Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xi
List of Tables
Table 5.1 Five subnetworks identified by cd-CAP in multi-subnetwork mode for each cancer
type: respective columns below depict the subnetwork size, depth, and the number of
nodes in the subnetwork with copy number amplification (AMP), expression increase
(EXP-UP) or decrease (EXP-DOWN). . . . . . . . . . . . . . . . . . . . . . . . . . 91
xii
List of Figures
Figure 2.1 Overview of HIT’nDRIVE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2.2 HIT’nDRIVE identified driver genes with respect to varying parameter values in 100
selected BRCA samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.3 HIT’nDRIVE identified driver genes with respect to underlying network used in 100
selected BRCA samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 2.4 Modified HIT’nDRIVE not required to prioritize at least one driver gene per patient. 29
Figure 2.5 Likelihood of HIT’nDRIVE to capture CGC Genes. . . . . . . . . . . . . . . . . . . 30
Figure 2.6 Correlation between the number of driver genes predicted by HITnDRIVE with mu-
tation rate and copy-number burden . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 2.7 Phenotype classification using driver-seeded modules . . . . . . . . . . . . . . . . . 32
Figure 2.8 Phenotype Classification using CGC Genes Seeded Modules. . . . . . . . . . . . . . 33
Figure 3.1 Summary of driver genes prioritized by HIT’nDRIVE . . . . . . . . . . . . . . . . 46
Figure 3.2 Network properties of driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 3.3 BRCA subtype classification using driver modules . . . . . . . . . . . . . . . . . . 48
Figure 3.4 Drug efficacy predicted by HIT’nDRIVE seeded driver genes. . . . . . . . . . . . . 49
Figure 4.1 Landscape of somatic mutations in PeM tumors . . . . . . . . . . . . . . . . . . . . 70
Figure 4.2 Landscape of copy number aberrations in PeM tumors . . . . . . . . . . . . . . . . 71
Figure 4.3 Gene fusions in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 4.4 Transcriptome and proteome profile of PeM . . . . . . . . . . . . . . . . . . . . . . 73
Figure 4.5 Immune cell infiltration in PeM tumors. . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 5.1 Schematic overview of cdCAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 5.2 Conserved colored subnetworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xiii
Figure 5.3 Colorful maximal subnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 5.4 Multiple subnetwork analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 5.5 Empirical p-value estimates for the maximum size subnetworks identified by cd-CAP. 95
xiv
Glossary
AGC Automatic Gain Control
BAM Binary Alignment Map
BMR Background Mutation Rate
BRCA Breast adenocarcinoma
C-INDEX Concordance-index
CGC Cancer Gene Census
CNA Copy Number Aberration
COSMIC Catalogue of Somatic Mutations in Cancer
CRS Cytoreductive surgery
DEG Differentially expressed genes
DGIDB Drug-Gene interaction database
DNA Deoxyribonucleic acid
EQED eQTL Electrical Diagrams
EQTL Expression Quantitative Trait Loci
FL Facility Location
FDR False Discovery Rate
FFPE Formalin-Fixed Paraffin-Embedded
xv
GBM Glioblastoma multiforme
HIPEC Hyperthermic intraperitoneal chemotherapy
HR Hazard Ratio
HT Hitting Time
IGV Integrative Genomics Viewer
IHC Immunohistochemical
ILP Integer Linear Programming
INDEL Insertion and Deletion
KNN k-nearest neighbour
LP Linear Programming
MFPT Mean First Passage Time
NIPEC Normothermic intraperitoneal chemotherapy
OV Ovarian serous cystadenocarcinoma
PCR Polymerase Chain Reaction
PM Pleural Mesothelioma
PRAD Prostate adenocarcinoma
PSM Peptide Spectral Matches
QPCR Quantitative Polymerase Chain Reaction
RECOMB Research in Computational Molecular Biology
RNA Ribonucleic Acid
RT-PCR Reverse Transcription PCR
RWFL Random Walk Facility Location Problem
xvi
RWR Random Walk with Restart
SNV Single Nucleotide Variation
SV Structural Variation
TCGA The Cancer Genome Atlas
TMA Tissue microarray
TMZ Temozolomide
TNBC Triple-Negative Breast Cancer
VCF Variant Calling Format
WMSC Weighted Multi-Set Cover
xvii
Acknowledgments
This research was supported in part by the CIHR Bioinformatics Training Program, Prostate Cancer
Foundation - British Columbia (PCF-BC) Research Awards, and Mitacs Accelerate PhD Fellowship.
My deepest gratitude goes to my PhD supervisors, Dr. Colin Collins and Dr. Cenk Sahinalp, for
their endless support, encouragement, and guidance throughout my graduate studies. It was their com-
mitments, ideas, and constructive criticisms that help shape my research and successfully publish my
papers. Under their guidance I’ve had several opportunities to hone my technical and scientific skills in
computer science and biology. I have learned from them, not only to be a good researcher but learned
many lessons in life beyond the confined boundaries of a laboratory.
I would like to thank my thesis committee members - Dr. Yuzhuo Wang, Dr. Artem Cherkasov,
and Dr. Wan Lam for providing me guidance whenever I needed. Special thanks to Dr. Josh Stuart for
accepting to read and evaluate my thesis as an external examiner. I would also like to thank Dr. David G.
Huntsman and Dr. Leonard Foster for accepting to read and evaluate my thesis as university examiners.
I have been incredibly blessed to be part of Vancouver Prostate Centre family. My most sincere
gratitude to my colleagues Mr. Ermin Hodzic, Mr. Kaiyuan Zhu, and Dr. Noushin Nabavi with whom
I have had opportunity to closely collaborate in my different research works. I would like to thank Dr.
Anna Lapuk and Mr. Kendric Wang who mentored me during initial phase of my PhD. I offer my regards
to Dr. Mads Daugaard, Dr. Faraz Hach, Dr. Stanislav Volik, Dr. Stephane Le Bihan, and Dr. Alex Wyatt
for the their invaluable guidance.
I would like to thank my collaborators Dr. Gholamreza Haffari, Dr. Andrew Churg, Dr. Thomas
Sauerwald, Dr. Phoung Dao, Dr. Fabio Vandin, and Dr. Kuoyuan Cheng. I would also like to thank
my present and former colleagues Dr. Yen-Yi Lin, Dr. Fan Mo, Dr. Dong Lin, Dr. Nilgun Donmez, Dr.
Ibrahim Numanagic, Mr. Shawn Anderson, Mr. Hans H. Adomat, Mr Robert Shukin, Mr. Robert H.
Bell, Mr. Brian McConeghy, Ms. Anne Haegert, Ms. Sonal Brahmbhatt, Mr. Jake Yeung, Mr. Salem
Malikic, Mr. Alex Gawronski, Mr. Ehsan Haghshenas, Mr. Mike Ford, Mr. Varune Ramnarine, Mr.
Hossein Sharifi-Noghabi, and Mr. Hossein Asghari. I benefited greatly from these collaborations, and
xviii
hope to continue working with them.
Last, but not least, I heartily thank my family for the strong motivation that they gave me to follow
my studies. Their support was invaluable to me.
xix
Chapter 1
Introduction
1.1 Cancer driver genesCancer is a major cause of death across the globe and remains a growing challenge to health-care systems.
Cancer is characterized by uncontrolled division (malignant growth) of abnormal cells (tumors) in the
body. All cancer arise due to somatically acquired changes in Deoxyribonucleic acid (DNA), Ribonucleic
Acid (RNA), or protein sequences of the cancer cells.
Cancer is a complex disease caused by combination of different genetic changes. These genetic alter-
ations includes, but not limited to, Single Nucleotide Variation (SNV), Insertion and Deletion (INDEL),
Copy Number Aberration (CNA), Structural Variation (SV), gene fusions, changes in amino-acids se-
quence of a protein, DNA methylation, and changes in the gene and protein level expression. Combi-
nation of these genetic alterations dysregulate different oncogenic or tumor-supressive signaling path-
ways thus promoting cancer growth. Furthermore, cross-talks between different signaling pathways is
inevitable but is often less understood [72].
Cancer is an evolutionary disease. The genetic changes occur in initiating cells (clones) undergo
intense evolutionary selection during disease progression and can be widely altered during treatment.
Cancers evolve by reiterative process of clonal selection, clonal expansion, and genetic diversification
within the adaptive landscapes of tissue ecosystems [67]. The cancer cell evolutionary process may lead
to sub-clonal divergence resulting in genetic and molecular heterogeneity.
During tumor progression, cancer cells accumulate a multitude of genomic alterations; however most
are inconsequential “passenger” alterations that are effectively neutral. Nevertheless, a small fraction
provide mission-critical “hallmark” functions and are known as “driver” alterations that modify tran-
scriptional programs and therefore drive and sustain tumor progression [69, 161, 195]. Driver alterations
1
are evolutionary advantageous for the tumor development. These are causally implicated in oncogenesis
and some even trigger cancer progression, resistance to the disease or therapy. Driver alterations are pos-
itively selected during evolution of the cancer. Improving our knowledge on driver alterations, possibly
through an integrative analysis of various omics data is critical to better understand cancer mechanisms
and select appropriate therapies for specific cancer patients.
1.2 Computational methods for the prediction of cancer driver genesAll computational methods for the prediction of cancer driver genes can be broadly grouped into three
different strategic approaches: (a) identification of recurrent somatic alterations, (b) prediction of func-
tional impact of somatic alterations, and (c) pathway and interaction-network based approaches
1.2.1 Identification of recurrent somatic alterations
In the early cancer genomics studies, the driver mutation were identified on the basis of alterations that
appeared more frequently across the patient cohort than expected by random chance. These driver muta-
tions were thought to drives the cancer phenotype and provide selective advantage for clonal expansion
of its lineage.
Recurrent Somatic Mutation
Several popular computational tools such as MutSigCV [96], MuSiC [46], and others [77, 159, 209]
have been developed based on this strategy. These method aim to identify recurrence frequency of SNVs
with respect to the Background Mutation Rate (BMR) in a population of tumors [68, 195, 209]. The
BMR is the probability of observing a passenger mutation at a specific location of the genome. The
main difference between the tools mentioned above are in how they estimate the BMR and how many
mutational context they consider. BMR is not constant across the genome but depends on the genomic
context. BMR estimate greatly effects the identification of recurrent mutation. If the BMR is lower than
the true value, then it will lead to false-positives whereas if the BMR is higher than the true value, then
it will lose some recurrent mutations.
Recurrent copy-number alterations (CNAs)
The identification of recurrent CNAs in tumors presents different set of challenges. Unlike SNVs, CNAs
effect more than one gene. Somatic CNAs show a large variation in their position and length across
different tumors. For example, an oncogene can be amplified in a tumor because of whole-chromosome
2
duplication whereas in other tumors the same oncogene amplification may be focal where the ampli-
fied locus also contains the oncogene. These issues makes identification of somatic driver CNAs more
challenging. Thus the computational methods developed to study such problems take a non-parametric
approach. Early approaches to identify recurrent somatic driver CNAs relied on identification of shared
regions of CNAs across the tumor cohort. The statistical significance of such overlaps were assessed by
fixing the length of alterations but independently permuting their position in tumors populations. More
recent approaches such as GISTIC2 [115], CMDS [212], JISTIC [147], DiNAMIC [196], and ADMIRE
[188] use more sophisticated models to assess the statistical significance of overlapping CNAs of differ-
ent lengths.
Frequency based approach are best suited to study the driver genes that frequently altered across
the tumor population. However, less frequently altered genes dominate and vastly outnumber frequently
altered genes [200]. Recent whole-genome studies have revealed that important genes may be recurrently
altered in only a small fraction of the tumor cohort under study, and can be subtype-specific [128, 170].
Furthermore, in the context of tumor evolution, personalized rare driver genes are likely to arise during
advanced stages and may be isolated to a small fraction of tumor cells [49, 67]. Such rare or personalized
driver alterations may be functionally important and are likely to be missed by the frequency-based
approach.
1.2.2 Prediction of functional impact of somatic alterations
Another approach for distinguishing driver alterations from passenger alterations is to predict the func-
tional impact of a mutation using additional biological information about the sequence and/or structure
of the protein encoded by the mutated gene. These methods are applied to the non-silent SNVs that result
in changes in the amino-acid sequence of the corresponding protein. Several methods have been devel-
oped to predict the effect of SNVs. ANNOVAR [197] provide annotation of transcript variants. FunSeq
[86] includes additional annotation of non-coding elements and regulatory features. MutationAssessor
[139] combines protein domain information with evolutionary conservation model to identify functional
impact of somatic mutations. Furthermore, CHASM [29], TransFIC [65], and OncodriveFM [64] uses
machine-learning algorithms trained on known cancer mutations to highlight potential driver mutations.
ActiveDriver [137] predict effects that are related to protein aggregation, protein stability and alterations
of residues targeted by post-translational modification. Other popular methods to access the effect of
SNVs on protein function includes Condel [63], SIFT [157], and Polyphen [2].
3
1.2.3 Pathway and interaction-network based approaches
Genes and their protein product act on different hierarchies of biochemical organization. Moreover,
gene/proteins do not act in isolation rather they act together with other genes in a signaling, regulatory,
or metabolic pathway collectively known as interactome. Examination of the collection of identified
somatic alteration in the interactome can lead to better understanding of the cancer progression. However,
the complex nature of the interactome is the confounding factor for the identification of driver genes and
their corresponding signaling interaction network.
Many computational methods have been developed to assess signaling networks or pathways per-
turbed by somatic mutations in cancer. Perhaps the first computational method to consider large scale
genomic variants as driver events is CONEXIC [4]. It correlates genes with highly recurrent CNAs
with variation in gene expression profiles within a Bayesian network. CONEXIC uses a score-guided
search to identify the combinations of driver CNAs that best explains the patterns of gene-expression
modules in the tumor phenotype. Similarly, with no prior knowledge of pathways or protein interac-
tions, MOCA correlates gene mutation information with expression profile changes in other genes [112].
NetBox [32] uses the shortest-path approach to connect the somatically altered genes in an interaction
network and then identify statistically significant connected modules containing potential driver genes.
Method by Suo et al [165] prioritizes highly mutated genes that interact with large number of differ-
entially expressed genes in a gene network. MEMo [37], identifies sets of proximally-located genes
from interaction networks, which are also recurrently altered and exhibit patterns of mutually exclusivity
across the patient population. MEMo first defines modules of highly connected nodes in the network
and then assesses if these network modules show mutually exclusive mutations. RME algorithm [117]
identifies modules with exclusive patterns of mutations using an information theoretic measure to test
for the significance of the observed exclusivity. RME starts from scores that measures the exclusivity of
pairs of genes, and includes only genes mutated with relatively high frequency, limiting its effectiveness
in identifying rare driver mutations. Another approach, (Multi) Dendrix aims to simultaneously identify
multiple driver pathways, assuming mutual exclusivity of mutated genes among patients, using either a
Markov chain Monte Carlo algorithm [190] or Integer Linear Programming (ILP) [101]. XSEQ [48] uses
probabilistic model to compute influence of mutated genes over expression profile changes in other genes
by considering direct gene interactions. Two other methods - PARADIGM [192] and PARADIGM-SIFT
[125] uses Bayesian network to integrate genomic and transcriptomic data to infer pathways altered in a
patient.
4
Network propagation based methods
In recent years, network propagation based methods has been used extensively for identification of dis-
ease associated genes. The main principle assumption behind network propagation methods is that genes
that belong to same phenotype interact with each other resulting in the amplified biological signal [41].
These group of methods aim to identify the genes that are in close proximity to the known disease genes
in the signaling interaction network. The prior knowledge or experimental measurement obtained from
the genomic, transcriptomic, proteomic, or epigenomic profile of an individual(s) are superimposed on
the network. The signal from the “source” node is then propagated to a distant “target” node through the
edges in the global interaction network. Instead of finding the single path connecting the source to the
target, network propagation methods computes the fraction of the “flow” (originating from the source
node) passing through each of the intermediate node/edges to the target node. The fraction of the flow
imitates the probability of using the path in the information propagation process. Network propagation
approach gives us the ability to incorporate multiple data-types (such as mutations, genomic aberrations,
gene-expression, confidence level of interactions, and functional associations of genes) to the probabilis-
tic network models [35]. Due to its powerful nature to predict distant interactions, network propagation
is used in many different disciplines including computer science, engineering, physics, and biology. In
biology, network propagation has been used in the context of gene function prediction, gene-module
discovery, disease genes discovery, disease subtyping, and drug target prediction. Below, I will further
elaborated on network propagation based methods built to identify disease genes or cancer driver genes.
The current flow approach is one of the ways to model network propagation. Current flow approach
assumes the flow of current in an electronic circuit, where each edge has an associated resistance. It is
based on the well-known analogy between random walks (discussed below) and electronic networks
where the amount of current entering a node or an edge in the network is proportional to the expected
number of random walk visit on the node or edge. eQTL Electrical Diagrams (EQED) [166] integrates
Expression Quantitative Trait Loci (EQTL) analysis with molecular interaction network using the circuit
network model. To the best of our knowledge, NetQTL [87] is the first method to link CNAs to expression
profile changes within an interaction network and connects specific “causal” aberrant genes with potential
targets in the interaction network. They formulated a Weighted Multi-Set Cover (WMSC) problem and
provided a greedy solution to identify the set of causal genes.
Another network propagation approach is to use random-walk (also known as fluid diffusion, diffu-
sion kernel, or graph kernel). A random walk, as its name indicates, propagates randomly starting from
a known disease gene (i.e. a seed gene) to its neighbouring genes with equal probability or with a given
prior probability. This iterative process of random walk is halted after certain number of steps. In order
5
to capture local neighbourhood of the disease gene, a variant of this process known as Random Walk
with Restart (RWR) is used as an alternative to the halting process. In RWR, a reset parameter is used
which insures that the random walker return to the seed nodes after each step of propagation. In this way,
we can identify genes/proteins interacting with the disease-gene as they are the nodes most often visited
during the random-walk simulation. This approach helps to prioritize genes/proteins and interactions on
the basis of their potential involvement a particular disease. The network propagation algorithm was first
described by Kondor et al [92]. Network propagation algorithm have been used to analyze friendship
networks, where edges represent similarity or affinity. It is the basis of the original Google’s PageRank
algorithm [23].
Tu et al [183] used a random walk approach on a molecular interaction network to associate causal
genes and pathways explaining a given association and applied the method to the data obtained from
yeast knockout experiments. Methods by Kohler et al [91] and PRINCE [191] uses variant of the ran-
dom walk algorithm to prioritize disease associated genes/proteins. Kohler et al [91] demonstrated that
random walk analysis of interaction networks outperforms local network-based methods, such as shortest
path distances and direct interactions. Yeger-Lotem et al developed a method - ResponseNet [204] which
was later expanded by Lan et al [94]. ResponseNet uses network algorithm to relate genetic perturbations
to transcriptomic response in yeast model thereby identifying sub-networks of regulators mediating the
interactions. ResponseNet formulated a mininum-cost flow optimization problem which aims to maxi-
mize the flow between the source and target while minimizining the cost of the connecting paths. Thus
by setting the cost of an edge to the negative log of its probability, a high-probability connecting sub-
network is obtained. They provided a Linear Programming (LP) formulation to solve the optimization
problem. HotNet [189], was the first method to use a network propagation (fluid diffusion) approach
[136] to compute a pairwise influence measure between the genes in the (gene interaction) network and
identify sub-networks enriched with mutations. HotNet then derives a two-stage multiple hypothesis test
to reduce the False Discovery Rate (FDR) in sub-networks discovery. Another method, TieDIE [130],
extends the heat diffusion strategies of HotNet by leveraging two different type of genomic inputs: mu-
tated genes and transcriptional factors. TieDIE identifies a collection of pathways and sub-networks that
associate a fixed set of driver genes to expression profile change.
Another method, DriverNet [14], aims to correlate genomic alterations with target genes expression
profile changes, but only among direct interaction partners. The novel feature of DriverNet is that it aims
to find the “minimum” number of potential drivers that can “cover” targets. DriverNet provided a greedy
approximation algorithm to solve the optimization problem.
Hitting Time (HT)) or First Passage Time is an alternative approach for estimating node influence
in the (gene interaction) network using network propagation. HT on a network is simply the expected
6
minimum number of steps (hops) taken from a source node to reach a target node. Since HT relies on
the global topology of the network, there are many possible paths that connects the source node to the
target node given the sparseness of the biological networks. However, we cannot be certain about the
probability of reaching a target node given the number of steps or which path is the most probable. For
this reason, measuring the average hitting time (or Mean First Passage Time (MFPT)) is more reasonable
for pairwise influence calculations [152]. To calculate the average HT, random walk simulation can be
utilized where the transition probability of the nodes may have equal probabilities or some pre-defined
probabilities.
Liben-nowell et al [106] was the first to make use of HT for link-prediction problem on social
networks. Average HT has been previously used for analyzing state transition graphs in probabilistic
Boolean networks to identify gene perturbations that quickly lead to a desired state of the system [153].
Yao et al [203] estimated the closeness of a candidate gene to a disease of interest by computing the
HT of a random walk that starts at the corresponding disease phenotype and ends at the candidate. Con-
damin et al [39] developed a method for computing exact hitting times in a complex network, depending
on fractal dimension (i.e. density of nodes) and random walk dimension (i.e. source-target distance in the
network). Torchala et al [181] extended this method using Hill’s algorithm which make use of transition
probabilities between the node. They also demonstrated that Hill’s algorithm is an efficient method to
calculate average HT in a network. This was later implemented on C++ as RaTrav [182].
1.3 ContributionsIn this thesis, we focus on computational problems involving identification cancer driver genes, and their
application to guide precision oncology. Our goal here is to design network propagation based efficient
computational algorithm for cancer driver gene prioritization integrating multi-omics cancer datasets.
More specifically we present the following contributions:
• We introduce HIT’nDRIVE [154, 155], a combinatorial algorithm that measures the potential im-
pact of genomic aberrations on changes in the global expression of other genes/proteins which
are in close proximity in a gene/protein-interaction network. HIT’nDRIVE then prioritizes those
aberrations with the highest impact as cancer driver genes. HIT’nDRIVE formulates the driver
prioritization problem as a “random-walk facility location” (RWFL) problem, which differs from
the standard facility location problem by its use of “hitting time”, the expected number of hops
to reach a “target” gene from a “source” gene, as a distance measure in an interaction network.
HIT’nDRIVE uses “inverse” hitting time as a measure of influence of a source gene over a tar-
get gene to identify the subset of sequencewise altered/source genes whose overall influence over
7
expression altered/target genes is maximum possible.
• Using multi-omics data from different cancer types, we identified both known as well as rare (and
potentially novel) patient-specific driver genes. We also demonstrate that by using HIT’nDRIVE-
identified driver genes and associated “network modules” (sub-networks seeded by driver genes
whose aggregate expression profiles correlate well with the cancer phenotype) as features, it is
possible to perform accurate phenotype classification. In fact, we found a number of breast cancer
subtype-specific driver modules that are associated with patients’ survival outcome. Finally, we
demonstrate that HIT’nDRIVE-identified driver genes accurately predict drug efficacy in pan-
cancer cell lines.
• We present a first-in-field comprehensive integrative multi-omics analysis of a patient cohort of
treatment-nave peritoneal mesothelioma (PeM) [156]. In a novel contribution, using HIT’nDRIVE,
we identified PeM with BAP1 loss to form a distinct molecular subtype characterized by distinct
gene expression patterns of chromatin remodeling, DNA repair pathways, and immune checkpoint
receptor activation. We also demonstrate that this subtype is correlated with inflammatory tumor
microenvironment and thus a candidate for immune checkpoint blockade therapies. Our findings
reveal BAP1 to be a trackable prognostic and predictive biomarker for PeM immunotherapy that
refines PeM disease classification. This is significant because almost half of PeM cases are now
candidates for these therapies. BAP1 stratification may improve drug response rates in ongoing
phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM
in which BAP1 status is not considered. This integrated molecular characterization provides a
comprehensive foundation for improved management of a subset of PeM patients.
• Our another novel and significant contribution is that we resolved the large discordance between
mRNA and protein expression patterns in PeM cohort. Most of this discordance is attributed to
chromatin remodeling genes and proteins linked to multimeric protein complex. The majority of
which are direct protein-interaction partners of BAP1. The discordance between the mRNA and
the protein expression patterns is most likely due to the ubiquitination and degradation of proteins
in these BAP1 regulated complexes to maintain functional stoichiometry.
• Lastly, we present a novel computational method, cd-CAP (combinatorial detection of Conserved
Alteration Patterns), that primarily uses an ILP formulation to identify subnetworks of an interac-
tion network, each with an alteration pattern conserved across (a large subset of) a tumor sample
cohort. cd-CAP simultaneously identifies more than one subnetwork, and each gene within each
subnetwork has labels specific to the alteration types it harbors. Notably, we demonstrate that
8
many of the largest highly conserved subnetworks within a tumor type solely consist of genes that
have been subject to copy number gain, typically located on the same chromosomal arm and thus
likely a result of a single, large scale copy number amplification. We have also demonstrated that
the subnetworks identified using cd-CAP are associated with patients’ survival outcome and hence
are clinically important.
In addition to our primary contributions to the driver gene identification problems mentioned above,
our other contributions to the field of Computational Biology and Cancer Genomics can be found in
[61, 109, 149, 198, 201, 202]
1.4 Organization of the thesisThe rest of the thesis is organized as follows:
• In Chapter 2, we introduce HIT’nDRIVE, a combinatorial algorithm to prioritize cancer driver
genes. Then we present our experimental results exploring the behaviour of HIT’nDRIVE.
• In Chapter 3, we present extensive analysis of multi-omics data from multiple cancer types using
HIT’nDRIVE. Here we identify cancer driver genes in multi-omics cancer dataset as mentioned
above and explore their network properties. Then we demonstrate application of HIT’nDRIVE
on cancer phenotype and subtype classification, and drug efficacy prediction to guide precision
oncology.
• Chapter 4 describes integrative multi-omics characterization of a patient cohort of a rare cancer,
peritoneal mesothelioma. Here we demonstrate application of HIT’nDRIVE, which helped us
define a novel molecular subtype of peritoneal mesothelioma. We predicted this subtype would
likely respond to immunotherapy.
• In Chapter 5, we introduce cd-CAP, a combinatorial algorithm to identify sub-networks with con-
served molecular alteration pattern across a large subset of a tumor sample cohort. Then we present
our experimental results analyzing multi-omics data from multiple cancer types using cd-CAP.
• Finally, in Chapter 6, we offer a summary and conclusion of our contributions to cancer driver
gene identification, as well as discussion of possible directions for future work.
9
Chapter 2
HIT’nDRIVE: an algorithm for cancerdriver genes prioritization using hittingtime
2.1 IntroductionGenomic and transcriptomic alterations are the major contributors of tumorigenesis and progression
of cancer. Over the past decade, high-throughput sequencing efforts have provided an unprecedented
opportunity to identify these alterations in cancer that can lead to changes in gene regulation, protein
structure, and function [161]. Genomic and transcriptomic data provide unique and complementary
information about a particular tumor, but the translation of “big” molecular data into insightful and
impactful patient outcomes is extraordinarily challenging [195]. As explained in Chapter 1, during tumor
progression, cancer cells accumulate a multitude of genomic alterations with most being inconsequential
“passenger” alterations that are effectively neutral. However, a small fraction provide mission-critical
“hallmark” functions and are known as “driver” alterations that modify transcriptional programs and
therefore drive and sustain tumor progression [69, 161, 195]. The knowledge of driver alterations is
foundational to guide selection of appropriate therapies. For this we need to better integrate different
omics data-types and distinguish critical driver events from others.
Among different strategies explained in Chapter 1, the ones based on mutual exclusivity still fo-
cus on frequent events. The others, based on “information flow” in gene/protein interaction networks,
do not aim to discover cancer drivers, but rather are designed to identify dysregulated sub-networks or
10
modules. In addition, the notion of influence they employ is based on stationary distribution of “informa-
tion” originating at a particular gene/protein. As a result, none of the available methods aim to identify
rare, patient-specific driver events, based on a time dependent notion of influence. Finally, none of the
available techniques aim to simultaneously consider different types of genomic alterations as potential
drivers.
2.2 Our ContributionsTo address the above challenges, in this chapter, we introduce a novel combinatorial method, HIT’nDRIVE
[154, 155], which was first presented at the Research in Computational Molecular Biology (RECOMB)
conference . HIT’nDRIVE is a combinatorial algorithm that measures the potential impact of genomic
aberrations on changes in the global expression of other genes/proteins which are in close proximity
in a gene/protein-interaction network. HIT’nDRIVE then prioritizes those aberrations with the highest
impact as cancer driver genes. HIT’nDRIVE formulates the driver prioritization problem as a “random-
walk facility location” (RWFL) problem, which differs from the standard facility location problem by its
use of “hitting time”, the expected number of hops to reach a “target” gene from a “source” gene, as a
distance measure in an interaction network. HIT’nDRIVE uses “inverse” hitting time as a measure of
influence of a source gene over a target gene to identify the subset of sequencewise altered/source genes
whose overall influence over expression altered/target genes is maximum possible.
Since RWFL problem is NP-hard, we estimate the multi-hitting time based on the independent hitting
times of the drivers to an expression outlier, which provides an upper bound on the multi-hitting time.
Our experiments show that this estimate works well for the human protein interaction network. More
importantly, our estimate enables us to reduce the RWFL problem to a Weighted Multi-Set Cover (WMSC)
problem, for which we give an ILP formulation.
2.3 HIT’nDRIVE Algorithmic FrameworkHIT’nDRIVE links alterations at the genomic level to changes at transcriptome level using gene/protein
interaction network. For that, it aims to find the smallest set of altered genes that can explain most of
the observed transcriptional changes in the cohort. In other words, HIT’nDRIVE identifies the minimum
number of potential drivers which can cause a user-defined proportion of the downstream expression
effects observed. We formulate this as a Random Walk Facility Location Problem (RWFL) problem, a
combinatorial optimization problem that we introduce here. RWFL generalizes the classical Facility
Location (FL) problem by changing the notion of distance it uses. Given a network, FL problem defines
the distance between a potential driver gene and an outlier gene as the length of the shortest path between
11
them. The RWFL problem, in contrast, uses “hitting time” [39, 106], the expected length of a random
walk between the two nodes, as their distance. Under the use of hitting time, the FL problem completely
changes nature: in the classical FL formulation the goal is to associate each outlier gene in the network
with exactly one (the closest) driver gene. In the RWFL formulation, each outlier gene is associated with
multiple drivers (whose collective distance to the outlier will no longer be the shortest pairwise distance),
forming a many-to-many relation. Intuitively, hitting time measures how accessible a particular outlier
gene is from potential drivers. Thus RWFL problem asks to find the smallest set of sequence-altered
genes from which one can reach (a good proportion of) outliers within a user defined “multi-hitting
time” - the expected length of the shortest random walk originating from any of the sequence altered
genes, and ending at an outlier.
In order to capture the uncertainty of interactions of genes with their neighbours, it considers a
random walk process which propagates the effect of sequence alteration in one gene to the remainder
of the genes through the network. As a result, the influence is defined to be the inverse of hitting-time,
which is the expected length (number of hops) of a random walk which starts at a given potential driver
gene, and “hits” a given target gene the first time in an interaction network. More specifically, for any
two nodes u,v ∈V of an undirected, connected graph G = (V,E), let the random variable τu,v denote the
number of hops in a random walk starting from u and visiting v for the first time. Then the hitting-time
Hu,v is defined as Hu,v = E[τu,v] [104].
In order to capture synthetic lethality like scenarios, HIT’nDRIVE considers multiple aberrated genes
as potential drivers. For that, we define the influence value (of a set of potential driver genes on a target) as
the inverse of multi-hitting time. More specifically, let U ⊆V be a subset of nodes of G and v∈ (V−U)be a single node. We thus define the multi(source)-hitting time HU,v as HU,v = E[minu∈U τu,v].
Now the RWFL problem for a single patient can be described as follows. Let X be a set of potential
driver genes and Y be a set of expression altered (outlier) genes. Then, for a user defined k, HIT’nDRIVE
can aim to return k potential driver genes as solution to the following optimization problem:
argminX⊆X ,|X |=k maxy∈Y
HX ,y
where HX ,y denotes the multi-hitting time from the gene set X to the gene y.
As per the standard facility location problem, RWFL is NP-hard. In fact, even the problem of comput-
ing the multi-hitting time between a set of nodes in a network and a particular target node is difficult. We
overcome this difficulty by introducing a good estimate on the multi-hitting time that helps us to reduce
RWFL problem to the Weighted Multi-Set Cover (WMSC), which we solve through an ILP formulation.
(Although the use of set-cover for representing the most parsimonious solution in a bioinformatics con-
12
text is not new [75], to the best of our knowledge this is the first use of the multi-set cover formulation for
maximum parsimony.) In this formulation, we use a slightly different objective: given a user defined up-
per bound on the maximum multi-hitting time, we now aim to minimize the number of potential drivers
that can “cover” (a user defined proportion of) the outlier genes. For more than one patient, we minimize
the number of drivers that can “cover” (a user defined proportion of) patient-specific outliers such that
each such outlier is covered by potential drivers that are aberrant in that patient.
2.3.1 Reformulation of RWFL as a Weighted Multi-Set Cover (WMSC) Problem
For simplicity, we first describe how HIT’nDRIVE works on single patient data. Given an interaction
network with X denoting the set of sequence-altered genes (through SNVs or SVs) and Y denoting
the set of expression-altered genes, HIT’nDRIVE computes the smallest subset of X whose joint “in-
fluence” over (a user defined fraction of) expression-altered genes is sufficiently high (i.e. above a user
defined threshold). The influence of a set of (sequence-altered) genes X over an expression-altered gene
g is defined as 1MHT (X ,g) , where MHT (X ,g) denotes the multi-hitting time, the expected length of the
shortest random walk originating at each one of the genes in X that ends at g. Therefore, HIT’nDRIVE
aims to solve the RWFL problem in a network where X are the “potential facilities” and Y are the
“requests”.
Since RWFL is a computationally hard problem, and cannot be solved in a reasonable amount of
time in its original formulation, we reduce the RWFL problem to the WMSC problem, for which we
give an ILP formulation. Intuitively, in this new formulation, HIT’nDRIVE associates the genomic
alterations with transcriptomic changes in the form of a bipartite graph Gbip(X ,Y ,E ) where X is the
set of aberrant genes, Y is the set of patient-specific expression-altered genes, and E is the set of edges.
If gene xi is mutated in a patient p, we set edges between xi and all of the expression altered genes in
the same patient (y j, p) where the edges are weighted by the inverse pairwise hitting times wi j := H−1xi,y j
(Figure 2.1A). The WMSC problem on this representation of data asks to find the smallest subset of X
(as potential drivers) whose total influence (sum of pairwise influence values) over a user defined fraction
of expression-altered genes (for each patient) is sufficiently high.
The reduction from RWFL problem to the WMSC problem is achieved by estimating the multi-
hitting time as a function of independent hitting times of the drivers to an outlier, which provides an
upper bound on the multi-hitting time. The exact individual hitting times are calculated by a matrix
inversion method [173]. The resulting WMSC problem can then be formulated as the ILP below, which
is efficiently solvable by CPLEX (within minutes) for all data sets we considered.
13
minx1,..,x|X | ∑i xi
s.t.
∀i, j : xi = ei j
∀ j : ∑i ei jwi j ≥ y jγλ j ∑i wi j
∑ j y j ≥ α|Y |∀p : argβ f ractiono f highestλ j
(y j) = 1
xi,ei j,y j ∈ 0,1
The above ILP formulation for the WMSC problem introduces binary variables xi, y j, ei j, respec-
tively, for each potential driver, expression-alteration event, and edge in the bipartite graph. The objective
of the ILP is to minimize the number of drivers (i.e. the sum of xi values) subject to four constraints. The
first constraint ensures that a selected driver contributes to the coverage of each of the expression alter-
ation events it is connected to (in each patient, if multiple patients are available). The second constraint
ensures that selected (patient-specific) driver genes contribute enough to cover at least a (γ) fraction of
the sum of all incoming edge weights to each expression alteration event. This constraint corresponds to
setting an upper bound on our estimate on the inverse of multi-hitting time of the selected (patient spe-
cific) drivers on an expression alteration event. The third constraint ensures that the selected driver genes
collectively cover at least an α fraction of the set of expression alteration events. And the fourth con-
straint ensures that for each patient, the top β fraction of expression altered genes with highest weights
(λ j) are always covered.
As indicated above, our ILP formulation for WMSC problem can be generalized to multiple patients
with the objective of minimizing the total number of driver genes across all patients, subject to the
constraint that a user-defined proportion of outlier genes in each of the patients are covered by the subset
of drivers present in that patient.
In order to quantitatively assess the genes identified by HIT’nDRIVE, we extended our previously de-
veloped algorithm, OptDis [44], for de novo identification of modules of small size inside the interaction
network which contain (i.e. are seeded by) at least one predicted driver. The modules are chosen so that
their discriminative power (for phenotype classification) is the greatest among connected sub-networks
of similar size that contain the individual predicted drivers. In general, OptDis performs supervised di-
mensionality reduction on the set of connected sub-networks. It projects the high dimensional space of
all connected sub-networks to a user-specified lower dimensional space of sub-networks such that, in the
new space, the samples belonging to the same class are closer and the samples from different class are
more distant to each other (i.e. minmize in-class distant and maximize out-class distance) with respect
to a normalized distance measure (typically L1). Then we use module features (average expression of
14
genes in the module) for phenotype classification (Figure 2.1B-C). Using such module features, we hope
that the classifier in use does not overfit on rare drivers and is able to generalize the signal coming from
rare drivers to new patients. We report the classification accuracy based on the identified driver-seeded
modules as means of quantitative validation of our results (in the absence of ground truth). We also
look at the genes that build the chosen modules (of high classification accuracy) in attempt to identify
cancer-related pathways.
2.4 ResultsWe have implemented HIT’nDRIVE in C++ and solved the ILP using IBM CPLEX version 12.5.1. We
first tested the behaviour and robustness of HIT’nDRIVE given different parameters used in the algo-
rithm. These in silico experiments were performed using multi-omics data from four major cancer types
- Glioblastoma multiforme (GBM) [175], Ovarian serous cystadenocarcinoma (OV) [176], Breast ade-
nocarcinoma (BRCA) [177], and Prostate adenocarcinoma (PRAD) [178] obtained from the The Cancer
Genome Atlas (TCGA) data portal. Here we describe the results exploring the behaviour of HIT’nDRIVE
algorithm when used for the analysis of multi-omics cancer datasets. The biologically motivated results
obtained using HIT’nDRIVE are extensively discussed in Chapter- 3.
2.4.1 HIT’nDRIVE parameters
HIT’nDRIVE uses three user-specified input parameters:
1. α: fraction of outliers to be covered overall (across all patients)
2. β : fraction of outliers to be covered in each patient
3. γ: fractional lower bound on the sum of the incoming edge weights from driver genes selected by
HIT’nDRIVE
HIT’nDRIVE is robust with respect to the changes in α and β but is somewhat sensitive to γ (Fig-
ure 2.2A-B), as expected. However, as γ grows, the driver genes identified by HIT’nDRIVE do not
change but simply grow in number by the addition of new driver genes, which indicates robustness of
our method with respect to γ as well.
2.4.2 HIT’nDRIVE: expression outlier stringency
The higher the stringency we apply on the expression value change in a potential outlier, the fewer
outliers we will identify, which in turn will result in fewer number of driver genes. However, the new
15
set of driver genes obtained are, in general, a subset of the first set of driver genes, again indicating
robustness (Figure 2.2C-D).
2.4.3 HIT’nDRIVE: random alterations and random expression outliers.
We compared the HIT’nDRIVE predictions of driver genes among observed mutations with those ob-
tained through randomized mutations (Figure 2.2E) and random outliers (Figure 2.2F). There is a stark
contrast between the two sets of driver gene predictions with respect to their overlap with the Cancer
Gene Census (CGC) [59] data set - conserved through different values of the γ parameter (the overlap is
generally preserved across various settings of the remaining two parameters, namely α and β ). Driver
genes predicted in the non-randomized alteration (or non-randomized outliers) data not only (i) included
a higher number of CGC genes (i.e. more number of true driver genes) as compared to that in driver
genes predicted from randomized alterations (or randomized outliers) data, but also (ii) the number of
CGC driver genes predicted through the use of non-randomized data increased quickly with increasing γ
parameter, whereas it stays roughly the same when randomized data was used. Note that while perform-
ing randomization, the original gene labels (sequence-wise altered genes or expression-outlier genes)
were randomly replaced by new ones while preserving their recurrence frequency distributions.
2.4.4 HIT’nDRIVE: network perturbation
We used STRING v10 network for our analysis. The edges of the STRING v10 network was perturbed
to different extent (between 1-10%) preserving the degree of the nodes in the network. HIT’nDRIVE
analysis was performed using different perturbed networks. Proportion of common driver genes between
the unperturbed network and each of the perturbed network were calculated (Figure 2.3A-E). We ob-
served that even though the edges of the network were perturbed, the list of driver genes did not change
to a great extent (i.e. the overlap of driver genes was very high) as compared to the non-perturbed net-
work even when the edges of the network were perturbed by up to 10%. This clearly demonstrates that
HIT’nDRIVE is not biased towards network perturbations.
2.4.5 HIT’nDRIVE: underlying network
We evaluated the robustness of HIT’nDRIVE on three networks, namely STRING, HPRD and the RE-
ACTOME. Only 34% of the vertices in STRING, HPRD, and the REACTOME are shared in all three
networks; in terms of edges, an even smaller proportion of the edges. Not surprisingly, the more nodes
the network has, the more driver genes HIT’nDRIVE predicts. This is consistently observed across var-
ious parameter settings. What is noteworthy is that the percentage overlap between the driver genes
16
predicted on the three networks is quite robust, i.e., the percentage of driver genes shared between all
three networks is preserved across various parameter settings - e.g. this overlap is above 60% between
the REACTOME and any of the other two networks, across various values of gamma - which is quite
impressive. In fact the driver genes predicted on STRING are almost a superset of those predicted on
REACTOME. See Figure 2.3F.
2.4.6 Modified HIT’nDRIVE: when it is not required to prioritize at least one drivergene per patient.
In HIT’nDRIVE, at least one gene is picked per patient (i.e. when the β > 0). This constraint is based on
the implicit assumption that at least one causal mutation should be driving cancer (although there could
be exceptions to this, for example, the driver event could be something other than genomic alteration,
and be in the form of methylation, aberrant expression of a regulatory RNA or a metabolite, they could
all be incorporated in our framework, given matching data - which unfortunately is not available through
TCGA). There are also important performance issues related to the value of beta: (1) Setting β > 0
significantly improves the robustness of our method with respect to the alpha parameter. In Figure 2.4, it
can observed that the alpha parameter has minimal effect on the output of our method - provided beta is
non-zero. If β = 0 (i.e. patients do not necessarily have one driver gene), our method is less robust, as can
be seen in Figure 2.4B. In Figure 2.4C, especially for small values of alpha, the number of patients that
do not have a driver gene increases as the value of gamma decreases. In the worst case,∼40% of patients
do not report a driver gene; this happens when α = 0.5 and γ = 0.02. For guaranteeing robustness, the
γ value should be set above 0.2 and the α value should be set above 0.7, which reduces to the fraction
of patients with no driver genes to 5%. (2) Setting β = 0 significantly increases the running time of our
method, from a couple of minutes to several days on very large datasets.
2.4.7 HIT’nDRIVE’s ability to capture CGC genes
To check if HIT’nDRIVE is able to capture the true driver genes, we perform the following analysis. For
the sake of this analysis, let us first assume that the cancer-type specific genes listed in CGC database are
the true driver genes i.e. the ground truth. We predicted potential driver genes in patients from four major
cancer types using HIT’nDRIVE (for details see Chapter 3). For every patient analyzed, we compared the
input (i.e. all sequence-wise altered gene) and the output (i.e. subset of the input sequence-wise altered
genes that are predicted as potential driver genes) data for HIT’nDRIVE. We compared the amount of
CGC true driver genes present in the input data versus amount of CGC true driver genes captured by
HIT’nDRIVE.
17
The Figure 2.5A-D summarizes the results of this analysis. As can be seen, the likelihood of a
sequence-wise altered CGC gene to be prioritized by HIT’nDRIVE is much higher than that of a non-
CGC genes. Next, for each patient, we calculated the likelihood of HIT’nDRIVE to capture CGC genes
(see Section 2.6.6 for details). We found that majority of the samples analyzed have a very significant p-
value (i.e. < 0.01) (Figure 2.5E). This analysis demonstrates that HIT’nDRIVE is able to capture cancer
driver genes, to a larger extent, in the patient samples analyzed.
2.4.8 Correlation of predicted driver genes with alteration burden.
To obtain the mutation rate, we calculated the somatic mutation frequency per Mb (considering mutations
in protein-coding genes only). We obtained copy-number burden values (i.e. percentage of somatic copy-
number genome changed) using BioDiscovery Nexus Copy Number software. Figure 2.6A summarizes
the correlation between mutation rate and copy-number burden. As reported in many recent studies,
samples in OV, PRAD and BRCA had high copy-number burden. In case of GBM, majority of samples
had more or less equal mutation and copy-number burden.
Figure 2.6B shows the correlation of number of HIT’nDRIVE predicted driver genes with Muta-
tion rate. Except for a few highly mutated samples in BRCA, the number of driver genes predicted
by HIT’nDRIVE was not correlated with the somatic mutation rate of the respective sample. Finally,
Figure 2.6C shows the correlation of number of HIT’nDRIVE predicted driver genes with copy-number
burden. Here too we observed the number of HIT’nDRIVE predicted driver genes were largely indepen-
dent of the somatic copy number burden in the genome. Therefore, except for the hypermutated cases, the
number of HIT’nDRIVE predicted driver genes is independent of both mutation rate and copy-number
burden.
2.4.9 Phenotype classification using dysregulated modules seeded with the predicteddriver genes
Evaluating computational methods for predicting cancer driver genes is challenging in the absence of the
ground truth (i.e. follow-up biological experiments). Therefore, we mainly focused on testing whether
our predictions provide insight into the cancer phenotype and improve classification accuracy on an
independent cancer dataset. To test association of the driver genes identified by HIT’nDRIVE with the
cancer phenotype, as explained in the earlier section, we used the driver gene seeded gene-modules, a
set of functionally related genes (e.g. in a signaling pathway), from the protein interaction network, as
features for classifying the cancer phenotype. Using OptDis (here referred to as HIT’nDRIVE-OptDis),
we identified small connected sub-networks that include (i.e. are seeded by) predicted driver genes in
18
a greedy fashion. More specifically, we prioritized sub-networks (of at most seven genes) iteratively so
that in each iteration we identified the sub-networks that maximally discriminates sample phenotypes
in a gene-expression matrix, among the sub-networks that share very few genes (at most 20%) with the
sub-networks already prioritized.
Furthermore, we have also developed an unsupervised method for module identification (here re-
ferred to as HIT’nDRIVE-unsupervised), i.e. one that does not depend on any phenotype information.
This unsupervised method seeds each module with one HIT’nDRIVE identified driver gene, and includes
outlier genes that it has influence over and co-occurs with significantly across patients. For this, we per-
form a hypergeometric test to identify significant driver-outlier interaction (i.e. mutual presence) pairs
across the patient cohort (pvalue < 10-3).
Here we compare HIT’nDRIVE-OptDis and HIT’nDRIVE-unsupervised to another network based
driver genes prioritization method - DriverNet [14]. DriverNet itself does not aim to identify modules that
we can use to compare against HIT’nDRIVE-OptDis or HIT’nDRIVE-unsupervised modules. Rather,
DriverNet identifies driver genes in an iterative fashion, where in each iteration, DriverNet picks the
driver genes which “covers” the maximum number of uncovered outliers. We use this driver and the
outlier genes it covers as the “next” DriverNet module.
We used the set of prioritized sub-networks, i.e. the driver modules, first, to perform binary sample
classification: tumor vs normal. For this, we used gene-expression data for each of the four cancer
types (GBM, OV, PRAD and BRCA) from TCGA as discovery datasets to calculate the mean gene
expression value for each sub-network/driver module, for each patient. On these sub-networks, we used
the k-nearest neighbour (KNN) classifier (with k = 1), to perform classification on both the expression
values from TCGA, and additional validation gene-expression datasets (Figure 2.7A-C). The additional
validation datasets were used in order to assess the capability of the modules identified on TCGA cohort,
in classifying other cohorts.
For every dataset analyzed, the maximum classification accuracy achieved by HIT’nDRIVE mod-
ules (either HIT’nDRIVE-unsupervised or HIT’nDRIVE-OptDis), for any number of modules consid-
ered, was higher than that achieved by DriverNet modules (Figure 2.7A). Moreover, in most datasets,
HIT’nDRIVE methods achieve maximum or near-maximum accuracy using a smaller fraction of mod-
ules. All three methods achieved perfect or near perfect classification accuracy in TCGA-GBM, TCGA-
OV and TCGA-BRCA datasets except for TCGA-PRAD dataset (where the maximum classification
accuracy achieved was 90% by HIT’nDRIVE-Unsupervised, 95% by HIT’nDRIVE-OptDis and 86% by
DriverNet). Overall, the driver modules (identified in one cohort) were able to distinguish the tumor
phenotype from normal very well in validation datasets (on other cohorts) supporting the relevance of
the identified driver genes to the cancer phenotype.
19
2.4.10 CGC cancer type-specific gene enrichment.
Next, we looked into the list of prioritized driver genes by both HIT’nDRIVE and DriverNet and their
overlap with the known CGC genes (Figure 2.7B). DriverNet selects a much larger number of driver
genes, as compared to HIT’nDRIVE, to cover most outlier genes (across all four cancer types) due
to its model considering only direct interactions in the network. In particular, in OV and BRCA, the
number of HIT’nDRIVE identified driver genes are an order of magnitude smaller than that of DriverNet.
Although in GBM and PRAD datasets, the number of driver genes identified by DriverNet is somewhat
lower and comparable to that identified by HIT’nDRIVE (primarily because most outliers were filtered
out due to sharing no interaction edge with candidate altered genes), HIT’nDRIVE identified driver
genes cover a significantly larger number of outliers. More importantly, even though HIT’nDRIVE
identifies a smaller number of driver genes, a larger fraction of these driver genes can be found in CGC
database - in comparison to the DriverNet identified driver genes. In fact, even a larger fraction of CGC
genes specific to the relevant cancer type can be found among HIT’nDRIVE identified driver genes.
Specifically, HIT’nDRIVE predicted four glioblastoma specific CGC genes (IDH1, PDGFRA, PIK3CA
and PIK3R1) in TCGA-GBM dataset. Among them, IDH1, PDGFRA and PIK3CA were not identified by
DriverNet. Similarly, four ovarian cancer specific CGC genes (BRCA1, BRCA2, CCNE1 and MAPK1)
were predicted in TCGA-OV dataset. CCNE1 was not identified by DriverNet. Five prostate cancer
specific CGC genes (BRAF, ERG, FOXA1, PTEN and SPOP) were predicted in TCGA-PRAD dataset.
BRAF and SPOP were not identified by DriverNet. And seven breast cancer specific CGC genes (BRCA2,
CCND1, CDH1, GATA3, MAP3K1, PIK3CA and TP53) were predicted in TCGA-BRCA dataset. Among
them, CDH1 and MAP3K1 were not identified by DriverNet.
2.4.11 Phenotype classification using CGC gene seeded modules
To evaluate the difference between HIT’nDRIVE predicted driver genes and a list of known driver genes
(from CGC), we performed the following experiments. First, using HIT’nDRIVE-OptDis, we compared
the HIT’nDRIVE driver seeded module with CGC gene seeded module to classify tumor vs normal
samples in TCGA-PRAD patient cohort. Note that among the four TCGA cancer cohorts we study in
this paper, only the PRAD cohort includes non-trivial number of patients with no known driver genes
(based on an unpublished study by PCAWG project) and thus provides a good testbed for novel driver
gene identification by HIT’nDRIVE. As can be seen, HIT’nDRIVE identified driver seeded modules
provide higher classification accuracy, potentially due to novel driver genes identified by HIT’nDRIVE.
The top HIT’nDRIVE modules associated with PRAD are seeded by (in the order of discriminative
ability) ERG, ACAN, FOXA1, ERG, PTEN and CDKN1B (Figure 2.8A). All but ACAN are CGC genes
20
associated with PRAD. HIT’nDRIVE successfully identified all these driver genes without the use of any
information related to known PRAD driver genes from CGC. In addition, HIT’nDRIVE identified ACAN,
a non-CGC gene as a potential driver gene of PRAD. In comparison, the modules identified for CGC
PRAD driver genes were seeded by (again in the order of discriminative ability) ERG, FOXA1, NCOR2,
BRAF, ERG and AR - missing PTEN due to potentially large overlap with other modules. Overall, the
modules seeded by HIT’nDRIVE identified driver genes provide a higher accuracy in discriminating
PRAD than CGC PRAD driver genes.
Next, we compared HIT’nDRIVE driver genes to CGC genes in breast cancer subtypes in TCGA-
BRCA patient cohort. Note that breast cancer is possibly the best studied cancer type with respect to
driver genes Thus it is not surprising that Basal, Her2 and Luminal-B subtypes show negligible dif-
ferentiation between HIT’nDRIVE predictions and CGC based predictions (Figure 2.8B). This is due
to big overlap between HIT’nDRIVE discovered modules and CGC modules (e.g. in BASAL, top 4
HIT’nDRIVE modules almost perfectly match the top 4 CGC modules - which, again, is not surpris-
ing since BRCA is a very well studied cancer with respect to driver genes). However, HIT’nDRIVE
show some advantage in Luminal-A. HIT’nDRIVE outperformed the CGC genes from 43rd module on-
ward. This may be due to HIT’nDRIVE predicted driver genes (seeds) such as DMD, ROCK1, AGAP1,
SHANK2 which are not part of CGC and these genes play important role in cancer.
2.5 DiscussionHere, we have presented a network-based combinatorial method, HIT’nDRIVE, which models the col-
lective effects of sequence altered genes on expression altered genes. HITnDRIVE aims to solve the
“random-walk facility location” (RWFL) problem on a gene/protein interaction network which differs
from the standard facility location problem by its use of “hitting time”, the expected minimum number
of hops in a random-walk originating from any sequence altered gene (i.e. a potential driver) to reach
an expression altered gene, as a distance measure. We introduced the notion of “multi-hitting time” and
presented efficient and accurate methods to estimate it based on single-source hitting time in large-scale
networks. HITnDRIVE reduces RWFL (with multi-hitting time as the distance) to a weighted multi-set
cover problem, which it formulates and solves as an ILP.
As a measure of influence, hitting time - the expected length of a random walk between two nodes, or
its general version, the multi-hitting time, is quite different from the diffusion-based measures or Rooted
PageRank, which are based on asymptotic distributions. We argue that hitting time is a better measure for
our purposes as it is: (i) parameter free (diffusion model introduces at least one additional parameter - the
proportion of incoming flow “consumed” at a node in each time step), (ii) it is time dependent (while the
21
diffusion model and PageRank measures the stationary behavior) and (iii) it is more robust with respect
to small perturbations in the network [74].
In this chapter, we demonstrated the robustness of HIT’nDRIVE to identify cancer driver genes in
multi-omics cancer datasets using a number of different experiments such as - varying the user defined
parameters of HIT’nRIVE, randomizing the input data, randomizing the interaction network, using dif-
ferent interaction networks. We also demonstrated that HIT’nDRIVE is able to capture cancer driver
genes, to a larger extent, in the tumors analyzed. Furthermore, we demonstrated that it is also possi-
ble to perform accurate phenotype prediction for tumor samples by only using HITnDRIVE implied
driver genes and their “network modules of influence” (small sub-networks involving each driver gene
where the aggregate expression profile correlates well with the cancer phenotype) as features, providing
additional evidence that these genes may be driving the cancer phenotype. The network modules we
identified may provide new insights into the biological mechanisms underlying tumor progression.
2.6 Methods
2.6.1 Datasets and Analysis
We used publically available datasets of four major cancer-types glioblastoma multiforme (GBM) [175],
Ovarian serous cystadenocarcinoma (OV) [176], breast adenocarcinoma (BRCA) [177], and prostate
adenocarcinoma (PRAD) [178] from The Cancer Genome Atlas (TCGA) project. All data were ob-
tained from TCGA data-portal in May 2014 which were mapped to GRCh37 genome build. Although
TCGA has recently made available all data re-aligned to the newer GRCh38 genome build, to ensure
compatibility, all TCGA data we have used in this study has been mapped to GRCh37.
Somatic mutation
Somatic mutation calls (level 2 data) from all available platforms/centres were merged. Only missense,
nonsense and splice-site mutations were marked as somatic-mutation alteration events.
Copy number aberrations (CNAs)
CNAs for GBM and OV, Agilent Human Genome CGH Microarray 244A (level 1) data files were used
and for PRAD and BRCA, Affymetrix Genome-Wide Human SNP Array 6.0 (level 3) data files were
used to generate the copy number profiles.
These Agilent FE format sample files were loaded into BioDiscovery Nexus Copy Number software
v7.0, where quality was assessed and data was visualized and analyzed. All samples were mapped to the
22
most recent genome build (hg 19, NCBI build 37) via Agilent probe identifiers and annotation (down-
loaded from Agilent’s website) based on the 1M SurePrint G3 Human CGH Microarray 1x1M design
platform. BioDiscovery’s FASST2 segmentation algorithm, a Hidden Markov Model based approach,
was used to make copy number calls. The FASST2 algorithm, unlike other common HMM methods for
copy number estimation, does not aim to estimate the copy number state at each probe but uses many
states to cover more possibilities, such as mosaic events. These state values are then used to make calls
based on a log-ratio threshold. The significance threshold for segmentation was set at = 5X10-6) also
requiring a minimum of 3 probes per segment and a maximum probe spacing of 1000 between adjacent
probes before breaking a segment. The log ratio thresholds for single copy gain and single copy loss
were set at 0.2 and -0.23, respectively. The log ratio thresholds for two or more copy gain and homozy-
gous loss were set at 1.14 and -1.1 respectively. Upon loading of raw data files, signal intensities are
normalized via division by mean. All samples are corrected for GC wave content using a systematic
correction algorithm. Only the high confidence copy number aberrations i.e. high copy number gain or
homozygous deletions were marked as copy-number aberrant events. Finally, genes that harbour either
a somatic-mutation aberrant event or a copy-number aberrant event were taken to be the final list of
abberant genes at the genomic level.
Gene expression
We used microarray based gene-expression (Affymetrix HT Human Genome U133 Array Plate Set)
(level-1) for GBM and OV data sets. Where as for BRCA and PRAD data sets, RNA-seq derived gene-
expression were used (level-3). Gene expression profiles of normal and tumor phenotype were used as
sample groups.
Gene fusions
Transcript fusions prediction calls for GBM, OV, BRCA and PRAD were obtained from TCGA Fusion
gene Data Portal (http://www.tumorfusions.org) [207]. The fusion partner genes were tagged for gene-
fusion alteration.
2.6.2 Interaction networks
We used STRING version 10 [168] protein-interaction network which contains high confidence func-
tional protein-protein interactions (PPI). Self-loops and interactions with missing HGNC symbols were
discarded and interaction scores were divided by 1000 to obtain percentage-like reliability score. Only
high confidence interactions with combined score of 0.9 or greater were selected. As a result we obtained
23
a network of 10971 nodes with 214298 interactions.
In the case of prostate cancer, we integrated STRING-10 protein-protein interaction network with
protein-DNA interaction network derived from Chip-seq experiments for transcription factors highly
relevant to prostate cancer - REST, FOXA1, AR, EZH2 [150] and ERG [141] resulting in a new combined
network of 13517 nodes and 220190 interactions.
To simulate HIT’nDRIVE using different underlying network we used two additional interaction net-
works: Human Protein Reference Database - Protein-Protein Interaction Database (HPRD-PPI) network
(version 9.0) [134] and REACTOME pathway database (version 2015) [55].
2.6.3 Validation dataset
For the validation of driver-modules we used the following gene-expression datasets: GBM: Murat-
2008 [122], Sun-2006 [164]; OV: Yoshihara-2009 [205], Bowen-2009 [20]; PRAD: Taylor-2010 [169],
Grasso-2012 [66], SMMU-PC [138]; BRCA:METABRIC [42] and Richardson-2006 [140].
2.6.4 Derivation of expression outlier genes
We used generalized extreme studentized deviate (GESD) test [144] to obtain the outlier genes. Unlike
Grubbs test and the Tietjen-Moore test, GESD test only requires that an upper bound for the suspected
number of outliers be specified. Given the upper bound, r, the GESD test essentially performs r separate
tests: a test for one outlier, a test for two outliers, and so on up to r outliers.
2.6.5 Derivation of expression outlier gene weights
Outlier-gene weights were calculated as follows: Let i denote genes, j denote patients and xi j denote the
gene-expression value of gene i in patient j. We then calculated the absolute value of z-score (zi j).
zi j =|xi j−µi|
σi
where, µi and σi respectively denotes mean and standard deviation of expression value of gene i. Next
we performed Student’s t-test in the gene-expression values of normal and tumor phenotypes. where,
ψi =−log(pvaluettest). Finally, we calculate the outlier weight ωi j as
ωi j =ψizi j
∑i
ψizi j
24
2.6.6 Statistical significance of the overlap of driver genes with that of CGC database.
Suppose, for a cohort of cancer patients, we predict ntotal number of driver genes using HIT’nDRIVE, out
of which ncgc number of driver genes are present in the CGC database (of known cancer driver genes).
Let, x be the total number of sequence altered genes (i.e. all potential driver genes) and let y of these x
sequence altered genes be in CGC. This means that the probability that a randomly selected gene out of
these sequence altered genes happens to be a CGC gene is ( yx).
The probability (p-value) that at least ncgc out of ntotal driver genes are identified in CGC is:
pvalue =ntotal
∑i=ncgc
(ntotal
i
)(yx
)i(1− y
x
)ntotal−i
Next we consider driver genes in each patient. We also calculated the p-value for HIT’nDRIVE to
pick at least p CGC drivers out of p′ and pick at most q non-CGC drivers out of q′ as follows
pvalue =x=p′+q′
∑x=p′
(p+q
x
)(p
p+q
)x( qp+q
)p′+q′−x
25
Figure 2.1: Overview of HIT’nDRIVE algorithmic framework. (A) HIT’nDRIVE integratesgenome and transcriptome data obtained from patients’ tumor samples. The red and blue col-ors represent genomic alterations and transcriptomic changes in tumor samples, respectively.The influence values derived from the protein interaction network indicate how likely a drivergene influences its downstream target genes in the network. (B) The predicted driver genes areused as seeds to discover modules of genes that discriminate between the sample phenotypesusing OptDis. (C) Based on this the driver modules are ranked and thus prioritized.
26
Figure 2.2: HIT’nDRIVE identified driver genes with respect to varying parameter values in100 selected BRCA samples. (A-B) The number of driver genes identified by HIT’nDRIVEwith respect to the varying values of (A) γ , and (B) α . (C) The number of driver genesidentified by HIT’nDRIVE with respect to three outlier detection threshold values, acrossvarying values of the γ . (D) Proportion of HIT’nDRIVE detected driver genes obtained foroutlier threshold of 0.01 which are also detected when the outlier threshold is 0.05 and 0.1.(E-F) Driver genes predicted by HIT’nDRIVE in non-randomized data compared with thedriver genes predicted using randomized (i.e. by gene label swapping for 100 iterations). (E)altered genes and (F) outlier genes.
27
Figure 2.3: HIT’nDRIVE identified driver genes with respect to underlying network used in100 selected BRCA samples. (A) Venn Diagram showing the overlap of nodes in the threedifferent networks used - STRING v10 (only high-confident interactions), HPRD v9.0, andREACTOME v2015. (B) Comparison between the number of nodes in the network. (C) Com-parison between the number of edges in the network. (D) Comparison between the numberof driver genes detected using different networks. (E) Proportion of common driver genes be-tween the networks (STRING-REACTOME and HPRD-REACTOME) as compared to drivergenes detected using REACTOME network. (F) HIT’nDRIVE identified driver genes withrespect to network perturbation. The edges of the STRING ver-10 network was perturbedto different extent (between 1-10%) preserving the degree of the nodes in the network. Pro-portion of common driver genes between the unperturbed network and each of the perturbednetwork were calculated.
28
Figure 2.4: Modified HIT’nDRIVE not required to prioritize at least one driver gene per pa-tient. (A) Modified ILP formulation where we removed the constraint that ensured at leastone driver gene is prioritized per patient. (B) HIT’nDRIVE simulation with different valuesof gamma (γ) parameter with the modified ILP formulation as given in A. Each line repre-sents different values of alpha (α) parameter, which controls the fraction of total outliers tobe covered. (C) We calculated the fraction of patients with no driver genes prioritized, for thesame set of driver genes prioritized in B.
29
Figure 2.5: Likelihood of HIT’nDRIVE to capture CGC Genes. (A-D) Sequence-wise alteredCGC genes prioritized by HITnDRIVE v.s. that of non-CGC genes, for each patient sam-ple, across four cancer types. Only CGC genes specific to a cancer type is considered here.Green: Cancer specific sequence-wise altered CGC genes prioritized by HITnDRIVE; Red:Cancer specific sequence-wise altered CGC genes NOT-prioritized by HITnDRIVE; Orange:Sequence-wise altered non-CGC genes prioritized by HITnDRIVE; Purple: Sequence-wisealtered non-CGC genes NOT-prioritized by HITnDRIVE. The right panel depicts absolutenumbers and the left panel depicts relative proportions. As can be seen the likelihood of asequence-wise altered CGC gene to be prioritized by HITnDRIVE is much higher than thatof a non-CGC gene. (E) P-value Distribution of the likelihood of HIT’nDRIVE to pick CGCgenes.
30
Figure 2.6: Correlation between the number of driver genes predicted by HITnDRIVE withmutation rate and copy-number burden (A) Correlation between Mutation rate (frequencyof somatic mutation per Mb) with copy-number burden (percentage of genome changed cal-culated using somatic copy number changes). Correlation of the number of driver genespredicted by HIT’nDRIVE with (B) mutation rate and (C) copy-number burden.
31
Figure 2.7: Phenotype classification using driver-seeded modules. (A) Phenotype (tumor vs nor-mal) classification accuracy in gene-expression datasets of different cancer-types using threedifferent methods - HIT’nDRIVE-unsupervised (left panel), HITn’DRIVE-OptDis (middlepanel) and DriverNet (right panel). (B) Comparison of HIT’nDRIVE with DriverNet.
32
Figure 2.8: Phenotype Classification using CGC Genes Seeded Modules. Phenotype Classifi-cation accuracy of HIT’nDRIVE driver seeded module vs Cancer Gene Census (CGC) genesseeded modules. (A) TCGA-PRAD gene-expression dataset with Tumor and Normal sam-ples. (B) Subtype classification accuracy of HITnDRIVE identified driver seeded modules vsCGC BRCA driver seeded modules on TCGA-BRCA cohort with respect to four subtypes ofbreast cancer (Basal, Her2, Luminal-A and Luminal-B).
33
Chapter 3
Application of HIT’nDRIVE:patient-specific multi-driver geneprioritization for precision oncology
3.1 IntroductionTo demonstrate the utility of the HIT’nDRIVE, we analyzed over 2200 genomes and transcriptomes
(gene expression) of tumors from four major cancer types - glioblastoma, ovarian, breast and prostate
cancer from TCGA project. We present the driver genes obtained by HIT’nDRIVE on this dataset and
explore their functional properties. Many of the HIT’nDRIVE identified driver genes turn out to be
known drivers from the CGC database [59], demonstrating that it is possible to replicate the lengthy and
costly experimental approaches for detecting driver genes in common tumor types by HIT’nDRIVE -
in-silico, strongly supporting the biological relevance of HIT’nDRIVE’s algorithmic framework. This
observation increases our confidence in the calls made by HITnDRIVE in rarer tumor types for which
driver genes are mostly unknown. In fact, the initial results of the PanCancer Atlas project project [12]
reveal that more than 20% of tumors do not have a single (genomically altered) driver gene from CGC.
3.2 Our ContributionsIn this chapter, we used HIT’nDRIVE to identify both known as well as rare (and potentially novel)
patient-specific driver genes on large multi-omics data from different cancer types. We also demonstrate
that by using HIT’nDRIVE-identified driver genes and associated “network modules” (sub-networks
34
seeded by driver genes whose aggregate expression profiles correlate well with the cancer phenotype)
as features, it is possible to perform accurate phenotype classification - as additional evidence that
these genes are likely drivers of the cancer phenotype. We found a number of breast cancer subtype-
specific driver modules that are associated with patients’ survival outcome. Finally, we demonstrate that
HIT’nDRIVE-identified driver genes accurately predict drug efficacy in pan-cancer cell lines.
3.3 Results
3.3.1 HIT’nDRIVE predicts frequent as well as infrequent driver genes in multi-omicscancer datasets
We applied HIT’nDRIVE to prioritize driver genes in four major cancer types - Glioblastoma multiforme
(GBM) [175], Ovarian serous cystadenocarcinoma (OV) [176], Breast adenocarcinoma (BRCA) [177], and
Prostate adenocarcinoma (PRAD) [178] obtained from the TCGA data portal. Only samples with matched
genomic alterations (SNVs and/or CNAs and/or gene fusions) and transcriptomic changes (outlier genes
from gene-expression profile) were used in our study. We used the fusion prediction calls as reported in
the TCGA Fusion gene Data Portal [207].
In GBM, we obtained 48 unique candidate driver genes altered at varying frequencies across 258
GBM patients. EGFR (36%), TP53 (29.5%), PTEN (28%) and CHEK2 (26%) were the most fre-
quently altered driver genes in GBM followed by CDKN2A (16%), RB1 (13%), SEC61G (12%). Previ-
ous efforts in GBM genome characterization identified amplification in EGFR, PDGFRA, mutations in
CHEK2, TP53, PTEN, RB1, NF1 and deletions in CDKN2A to be associated with GBM [128, 175, 193].
HIT’nDRIVE prioritized all of the above alterations. Alterations in EGFR is characteristic of classical
subtype, NF1 with mesenchymal subtype, PDGFRA and IDH1 with pro-neural subtype of GBM [193].
Fifteen out of 48 driver genes predicted by HIT’nDRIVE (p-value = 8X10-4), were present in CGC
database [59], that contains genes for which mutations have been causally implicated in cancer (3.1A).
GSTT1 (deleted in 21 patients), a key player in drug metabolism, was neither found in CGC nor in Cata-
logue of Somatic Mutations in Cancer (COSMIC) [58] databases. Twelve GBM driver genes were found
to be actionable targets. Actionable genes were extracted from TARGET database [187], which contains
genes directly linked to a clinical action. In addition to the above list, 6 other driver genes were druggable
(Figure 3.1B). We extracted the list of druggable genes from Drug-Gene interaction database (DGIDB)
[70]. Interestingly, around 85% of the patients in GBM cohort harbour at least one actionable driver gene
and further 5% of patients have druggable targets (Figure 3.1C). HIT’nDRIVE also identified 12 infre-
quent driver genes, which we define as genes altered in at most 2% of the cases. Among the infrequent
35
genes, SACS is known to be associated with neurological functions, NLRP3 is involved in apoptosis, and
TIAM2 is involved in invasion and metastasis.
The 526 OV patients harboured a total of 85 unique driver alterations . TP53 mutations were preva-
lent in more than half (58%) of the patients in the cohort. Consistent with the previous findings, we
found OV patients to be driven by genomic copy-number changes rather than recurrent point muta-
tions [38, 129]. Recurrent somatic CNAs were observed in GSTT1 (32.3%), WWOX (28.1%), FAM49B
(15.0%), UGT2B17 (14.6%), CCNE1 (13.1%), SLC39A4 (13.1%) and MYC (12.5%). Mutations in
TP53, BRCA1/2 and loss of RB1, NF1 and CCNE1 were previously associated with OV [129, 176].
HIT’nDRIVE revealed 18 CGC driver genes (p-value = 2X10-5) (Figure 3.1A) among which 13 genes
were actionable targets and other 12 genes were at least druggable (Figure 3.1B). More than 75% of OV
patients harboured at least one actionable targets and additional 6% of patients have druggable target
(Figure 3.1C). GSTT1 (altered in 170 patients), in OV, is involved in estrogen and drug metabolism. It
was neither found in CGC nor in COSMIC databases. We identified 13 infrequent genes, among which
MAPK1 is known to play an important role in oncogenic pathways in cancer.
HIT’nDRIVE identified 40 driver genes across 333 PRAD patients Copy number loss of SPECC1L
(23.7%), STEAP1B (13%), WWOX (10%) and amplification of NSD1 (16.2%), SIRPB1 (16.2%) were
the most recurrent events in PRAD patients. We also found recurrent somatic mutation in MUC4
(11%), SPOP (10.5%) and TP53 (10%). The most common alterations in PRAD genomes are fusion
of androgen-regulated promoters with ERG and other members of ETS family of transcription factors
mainly, TMPRSS2-ERG fusions [180]. Since we relied on the gene fusion predictions obtained from
TCGA Fusion gene Data Portal [207] which analyzed only 178 (out of 333) patients, we observed ERG
gene fusion in only 5.7% cases. The more recent TCGA publication [178] reported ERG fusions in
almost half of the patients in the cohort. Moreover, the tools used for gene fusion detection, in the two
studies, were different as a result of which we observed much smaller number of ERG fusions than re-
ported previously. SPOP, TP53, FOXA1 and PTEN are the most frequently mutated genes which have
been previously associated with prostate cancer [13]. PRAD patients harboured 12 driver genes present
in CGC database (p-value = 9X10-4) (Figure 3.1A) out of which 8 driver genes were actionable (Figure
3.1B). Approximately a quarter of PRAD patients could benefit with actionable targeted therapy Figure
(3.1C). Moreover an additional 14% of patients harboured druggable genes which warrants deeper inves-
tigation of drug repurposing opportunities. NBPF1 (mutated in 17 patients), is a known tumor suppressor
gene known to have neural function and also involved in cell-cycle arrest, was neither found in CGC nor
in COSMIC databases. We identified 11 infrequent genes in PRAD among which IDH1 mutant patients
were recently identified as a distinct molecular-subtype of PRAD [178], NKX3-1 is required for normal
prostate tissue development and CDKN1B was previously associated with PRAD.
36
In BRCA, HIT’nDRIVE identified 107 driver genes across 1090 patients Somatic mutation of PIK3CA
(30.5%) and TP53 (30.2%) were the most recurrent events in BRCA. This was followed by somatic mu-
tation of CHD1 (11.2%), GATA3 (10.5%), MUC16 (6.9%), MAP3K1 (6.9%) and CNA amplification of
NSD1 (8.7%) and MED1 (6.9%). BRCA patients harboured 16 genes present in CGC database (p-value
= 9.3X10-3) (Figure 3.1A) among which 10 genes were actionable targets (Figure 3.1B). More than 60%
of BRCA patients could benefit with the actionable targeted therapy. Furthermore, additional 11% of
BRCA patients harboured at least one of the 19 potentially druggable genes (Figure 3.1C). ACACA (al-
tered in 36 patients mostly from HER2 subtype), involved in fatty-acid metabolism, was neither found
in CGC nor in COSMIC databases. We identified 46 infrequent driver genes among which BRCA2 and
GNAS have been previously linked to BRCA.
3.3.2 Network properties of cancer driver genes
Centrality of Driver Genes in the Interactome.
Cancer driver genes are known to occupy critical positions in the interactome. To check whether HIT’nDRIVE
predicted driver genes also occupy similar positions in the interaction network, we used the node degree
as a “local measure”, and node betweenness (the number of shortest paths between node pairs that pass
through the node) as a “global measure” of centrality. The driver genes predicted by HIT’nDRIVE in-
clude a number of well-known high-degree hubs - TP53, EGFR, RB1, MYC, PIK3CA, ERG, CHD1 that
are “central” in the interactome with high degree and high betweenness (Figure 3.2A). Although there
was very weak correlation between the number of edges (i.e. degree centrality) of a node and the number
of samples/patients in which it is identified as a driver, remarkably, each hub gene was typically altered in
a large fraction of patients. Because of their centrality perturbations, hub genes are likely to dysregulate
several other genes and the associated signaling pathways. Interestingly, HIT’nDRIVE also identified
low-degree genes (IDH1, MTAP, NF1, NRG1, NSD1) that reside in the periphery of the interaction net-
work. In particular, in prostate cancer, there seems to be an inverse correlation between the degree and
how often the gene is picked as a driver. Most of these low-degree genes are altered in a small fraction of
patients, indicating that HIT’nDRIVE, unlike many other methods, does not primarily return hubs that
are altered in a large number of patients but is capable of identifying rare driver genes without trivial
topological biases.
37
Influential nodes prioritized as cancer driver genes.
Next we examined the influential driver genes that are responsible for driving cancer. For this, we
computed the total outgoing influence from each altered gene (which has been chosen as a driver), defined
as the weighted sum of all influence values from the source to all outlier genes it is connected to (targets),
weighted by the corresponding outlier weights. First we investigated driver genes with high influence
values within each cancer type. We observed that on average the total influence of driver genes was
higher than that of other altered genes in all cancer types (Figure 3.2D). EGFR, PTEN, CHEK2, TP53
and CDKN2A were the most influential driver genes in GBM which together exerted 38.5% of the total
influence on the GBM patient cohort. In OV, TP53, GSTT1 and MYC together exerted 20% of the total
influence. Similarly, in PRAD cohort, SPOP, MUC4 and TP53 were the most influential genes exerting
23.7% of the total influence. PIK3CA, TP53 and CHD1 were the most influential genes exerting 23% of
the total influence on the BRCA patient cohort. Moreover, the gene influence was positively correlated
to its alteration frequency (Figure 3.2E).
We investigated influence of the predicted driver genes within individual patients. Many recurrently
altered driver genes had higher influence compared to other driver genes. For example, EGFR in GBM;
TP53 in OV; ERG in PRAD; TP53, PIK3CA and PTEN in BRCA.
Interestingly, among the highly influential genes there were also less-recurrent but functionally im-
portant and actionable driver genes. For example, somatic mutations in ABCB1 were influential driver
genes in seven GBM patients (3.2F). ABCB1 is a membrane-bound protein present in the endothelial
cells of the blood-brain barrier. It harnesses the energy of ATP hydrolysis to drive the unidirectional
transport of exogenous and xenobiotic substances (drug compounds) from the cytoplasm to the extra-
cellular space. It is known to transport many anticancer compounds including Temozolomide (TMZ),
which is used as a first-line treatment for GBM patients. Mutations and over-expression of ABCB1 in
GBM have been associated with resistance to TMZ [107]. It was intriguing that some of these GBM
patients had undergone treatment prior to tissue collection and were initially mislabelled as untreated
patients. Treatment-induced selection pressure in the drug transporter might be a plausible reason for
high influence exerted by ABCB1.
Similarly, HIT’nDRIVE predicted BRAF as driver genes in eight PRAD patients (6 somatic muta-
tions and 2 gene-fusions) (Figure 3.2G). These patients harboured BRAF as a highly influential driver
gene. None of these patients harboured BRAFV600E mutation that is prevalent in cutaneous melanomas,
thyroid cancer and many other cancer types. However, BRAFL597R can be targeted using MEK inhibitors
[21, 43]. BRAF plays important roles in growth factor signalling pathways, which affects cell division
and differentiation. These results serve as proof of concept that HIT’nDRIVE can prioritize functionally
38
relevant cancer driver genes.
3.3.3 Breast cancer subtype classification using driver modules.
Our next goal was to classify four major subtypes of breast cancer - Basal, HER2, Luminal-A and
Luminal-B. For that purpose, we performed binary classification for each subtype: e.g. Basal vs non-
Basal (including the normal samples). This was achieved through the use of HIT’nDRIVE-identified
driver genes from TCGA-BRCA as seed genes, with which we identified subtype-specific driver mod-
ules from TCGA-BRCA gene-expression data (as described for tumor classification). We respectively
obtained 37, 16, 43 and 39 subtype-specific driver modules for Basal, HER2, Luminal-A and Luminal-
B subtypes. As described above, using these sub-type specific driver modules as features, we per-
formed independent classification of BRCA subtypes in TCGA-BRCA, METABRIC-Cambridge and
METABRIC-Vancouver datasets [42].
Majority of Basal-like tumors constitute Triple-Negative Breast Cancer (TNBC), which are highly
aggressive tumors characterized by lack of expression of estrogen receptor 1 (ESR1), progestrone recep-
tor (PGR) and erb-b2 receptor tyrosine kinase 2 (ERBB2). Molecular mechanisms driving TNBC are
least understood and hence, no targeted therapies for TNBC yet exists [17]. Interestingly, HIT’nDRIVE
seeded driver modules were able to classify Basal-like tumors with much higher accuracy (98%) as com-
pared to other BRCA-subtypes - HER2 (94%), Luminal-A (85%) and Luminal-B (83%) (Figure 3.3A).
As expected, ESR1 and PGR was highly expressed in Luminal-A/B but not in Basal and HER2 sub-
types. Modules containing ESR1 were consistently down-regulated in Basal subtype and up-regulated in
Luminal-A/B subtype whereas module LUMB-03 was up-regulated in Luminal-B subtype. The ESR1
network neighbourhood included eleven known transcriptional targets of ESR1 (TFF1, PGR, SLC9A3R1,
GNAS, RARA, WWP1, WNT5A, TCF7L2, FKBP4, SPRY2, and RAD54B). These results were consistent
with previous findings [51]. ERBB2 was expressed only in 9 (of 16) HER2 modules and was the most
prominent hub in the large interactome of HER2 modules. All modules containing ERBB2 were up-
regulated in HER2 subtype and module expression pattern were consistent in different BRCA datasets.
PGR was present in 2 modules (BASAL-26 and HER2-12) both of which were down-regulated in Basal
subtype but up-regulated in Luminal-A/B. These results strongly suggest that HIT’nDRIVE can cap-
ture subtype-specific driver genes, and the driver-seeded modules we identified can indeed differentiate
BRCA subtypes.
39
3.3.4 Subtype-specific breast cancer driver modules are associated with survivaloutcome.
To test for association of subtype-specific driver modules with patient survival outcome, we developed
a risk-score defined as a linear combination of the normalized gene-expression values of the component
genes in the module weighted by their estimated univariate Cox proportional-hazard regression coeffi-
cients (see Methods). Based on the risk-score values, patients were stratified into low-risk (risk-score
< 33 percentile) and high-risk (risk-score > 66 percentile) groups. Both Cox regression coefficients of
each gene and risk-score cutoff values for each module were estimated from TCGA-BRCA cohort (train-
ing dataset), later these values were applied to METABRIC cohorts (test dataset). To assess whether the
risk-score assignment to high/low categories was valid, a log-rank test was performed for each module
in both training and test datasets.
We first compared driver-seeded modules against driver-gene-free modules that, according to Opt-
Dis, have the best discriminative score for the TCGA-BRCA dataset. For each module we calculated
three distinct indices: log-rank test pvalue, Hazard Ratio (HR) and Concordance-index (C-INDEX). We
found driver-seeded modules to outperform driver-free modules on all three indices demonstrating that
the driver-seeded modules were better correlated with survival (Figure 3.3B). Motivated by this, we
identified the top modules for each of the BRCA subtypes which do well based on all three indices and
checked whether they can return meaningful results with respect to survival. We found 9 driver mod-
ules significantly associated with patients’ survival outcome (p-value < 0.01, hazard-ratio > 1.5 and
concordance-index > 0.5) in TCGA-BRCA cohort. These 9 modules were also significantly associated
with patient survival outcome (p-value < 0.01) in two additional cohorts (METABRIC cohorts). It is
interesting to note that two of these modules (BASAL-02 and HER2-01) were seeded by an oncogene
- nuclear receptor coactivator 3 (NCOA3) driver gene. NCOA3 driver module was the second-topmost
module (Figure 3.3C) to separate Basal from other subtypes and the top-most module to separate HER2
subtype. NCOA3 driver module was down-regulated in Basal subtype and associated with patients’ over-
all survival (Figure 3.3D-E). A fraction of breast (and ovarian) cancer patients are known to harbour
NCOA3 mutation, amplification or deletion [71]. NCOA3 alone cannot distinguish the basal subtype.
NCOA3 requires other component genes in the module (AR, XBP1, TFF1 and SPDEF) to collectively
distinguish the basal subtype which, as per our knowledge, is a novel finding. However, the interaction
within the module are well known. NCOA3 is a coactivator of steroid hormone receptor, AR and ESR1,
and transcriptional target of XBP1 [71]. NCOA3 is known to stimulate many intracellular signaling
pathways that are critical for cancer proliferation and metastasis. The activity of NCOA3 is known to
be associated with reduced responsiveness to tamoxinfen in patients [126]. SPDEF is associated with
40
regulation of AR activity [100].
3.3.5 HIT’nDRIVE seeded driver genes accurately predict drug efficacy
Next, we obtained somatic mutation, copy number aberration and gene expression data of pan-cancer cell
lines from Genomics of Drug Sensitivity in Cancer (GDSC) project [80]. We used HIT’nDRIVE to iden-
tify driver genes of individual cancer cell lines. Following up on the premise by [80] that potential driver
genes (i.e. cancer genes, which include the CGC genes) alone could predict drug efficacy fairly well, the
predicted driver genes were used as seeds in the network (STRING v10) to identify sub-networks that
discriminate between the drug-response phenotypes (i.e. sensitive vs resistant cell lines). As available
in GDSC, 265 different drug treatments were tested on each cell line provided. We present results for
25 cancer types (the remaining 5 cancer types for which only a very limited number of cell lines are
available are statistically insignificant and thus have not been used).
Perhaps our most interesting result is that, for many drugs, the top HIT’nDRIVE predicted driver
module for cell lines of a specific cancer type (more specifically, OptDis modules seeded by HIT’nDRIVE
identified driver genes, prioritized with respect to drug efficacy) not only includes the drug target but
also the associated (downstream) signaling pathway. As importantly, we measured the accuracy of drug-
response phenotype classification using HIT’nDRIVE-OptDis for each drug-treatment in different cancer
types (Figure 3.4A). In most cancer types, HIT’nDRIVE-OptDis correctly predicted the response to more
than 25% of the drugs in 95% of the cell lines or more. Specifically, Stomach adenocarcinoma (STAD)
and Chronic Myelogenous Leukemia (LCML) are the cancer types with highest fraction of drugs pre-
dicted with an accuracy of 95% or more whereas Liver hepatocellular carcinoma (LIHC) and GBM are
the cancer types with the lowest fraction of drugs predicted with the same accuracy. Below we pro-
vide some of our observations on three well known/promising cancer drugs for which we obtained high
accuracy on cell lines of specific cancer types.
Gefitinib is a clinically approved (for patients with non-small cell lung cancer) protein kinase in-
hibitor which selectively inhibits EGFR. Interestingly, in BRCA, EGFR copy-number amplification or
overexpression primarily activates RAS-RAF-MAPK pathway and PI3K-AKT-mTOR pathway trigger-
ing response for cell proliferation, invasion and survival. Using HIT’nDRIVE, EGFR was found as a
driver gene of BRCA cell lines. Furthermore, EGFR seeded driver module was the second highest scor-
ing module to distinguish the drug-response phenotype increasing the classification accuracy to 98%
(Figure 3.4B,C).
Another example, Nutlin-3a is a promising pre-clinical stage compound which inhibits the interaction
between MDM2 and TP53 inducing apoptosis. MDM2 was predicted as a driver gene in OV cell lines
41
by HIT’nDRIVE. MDM2 seeded module was the top predictor (maximum accuracy 94%) of the drug-
response phenotype when treated with Nutlin-3a (Figure 3.4B,E). Our method predicted many other
interacting partners (both as seed or component genes in the module) of MDM2 and TP53 which are
known to play a critical role in TP53 pathway.
Finally, TMZ is a clinically approved first-line therapy for GBM. ABC transporters (including ABCB1)
help to transport TMZ from the extracellular space to the cytoplasm of a cell. TMZ methylates selective
nucleotides of DNA triggering DNA repair pathway. MGMT specifically removes the methyl groups
from the methylated nucleotides escaping from DNA strand breaks. MGMT was predicted as a compo-
nent gene in the third top-scoring module. Failure to repair DNA strand breaks triggers DNA damage
response pathway further activating TP53 and apoptosis. Interestingly, TP53 was predicted as the seed
of the top scoring module by HIT’nDRIVE-OptDis. Furthermore, another gene in the DNA damage re-
sponse pathway, CDKN2A, seeds another top ranking module, which improves the overall classification
accuracy to 97% (Figure 3.4B,D). Note that both CDKN2A and TP53 are the most frequently altered
genes in GBM.
3.4 DiscussionIn this chapter we have demonstrated that (1) HIT’nDRIVE increases our ability to identify potential ge-
nomic driver alterations. (2) HIT’nDRIVE prioritizes clinically actionable driver genes many of which
happen to be private drivers. This implies that it is possible to replicate the lengthy and costly ex-
perimental approaches for detecting driver genes in common tumor types by HIT’nDRIVE - in-silico,
strongly supporting the biological relevance of HIT’nDRIVE’s algorithmic framework. The fact that a
high portion of HIT’nDRIVE prioritized drivers in well studied cancer types overlap with known driver
genes increases our confidence in the calls made by HIT’nDRIVE in rarer tumor types for which driver
genes are mostly unknown. (3) HIT’nDRIVE prioritizes driver genes present in both the centre and
periphery of an interaction network. (4) Our analysis revealed that driver genes have higher collective
influence on the transcriptome than other altered genes. Some of these driver genes are central and nat-
urally have high influence, however there are also many non-central driver genes with high influence
over other genes in the network. (5) HIT’nDRIVE is especially suitable for identifying such non-central
driver genes or infrequent/private drivers. (6) HIT’nDRIVE can capture subtype specific driver genes
and such driver seeded modules can indeed differentiate between different subtypes of a cancer. (7)
We have demonstrated that subtype specific driver modules are also associated with patients’ survival
outcome providing additional evidence that these driver genes have clinical significance. (8) We also
demonstrated that HIT’nDRIVE seeded driver genes (more specifically, OptDis modules seeded by HIT-
42
nDRIVE identified driver genes, prioritized with respect to drug efficacy) not only include the drug target
but also the associated (downstream) signaling pathway. This provides us the possibility of identifying
and clinically targeting multiple genes (not necessarily sequence-wise altered but are nevertheless in the
module identified by HIT’nDRIVE) dysregulating critical oncogenic or metabolic pathways.
We also note that targeted therapeutics are being extensively used in clinical trials but the drug re-
sponse rate is very poor (only ∼5% of patients in clinical trials have good response to targeted thera-
peutics) [111, 135]. This is most likely because even if a cancer patient harbours an alteration for which
targeted therapeutics are available, we do not know if that alteration is responsible for driving the tumor
[16]. HITnDRIVE could potentially play a key role by prioritizing potential driver alterations from a vast
pool of passenger alterations. In our study, we have used drug efficacy data from pan-cancer cell lines
in order to demonstrate that the potential genomic drivers (more precisely driver gene seeded modules)
of the cell-lines can be used as features to predict drug-efficacy. Following similar procedure in clinical
trials, we believe that the application of HITnDRIVE to predict drug efficacy would likely improve the
drug response rate.
HIT’nDRIVE predicted ABCB1 as the most influential driver gene in seven TCGA-GBM cases
that were treated with TMZ prior to tissue collection. Using GDSC dataset, we demonstrated that
HITnDRIVE-OptDis can predict mechanisms of drug sensitivity for TMZ and other drugs (Figure 3.4G-
H). Since ABCB1 was not mutated in any of the GBM cell lines in the analysis, it was not identified
as a driver gene of GBM cell lines. However, the top seed driver gene, TP53, is an interaction partner
of ABCB1 (in STRING v10 network). Other seed driver genes and its component genes in the module
that are direct interaction partners of ABCB1 are UBC, CAV1, WDTC1 and DNAH8. ABC transporters
(including ABCB1) helps to transport TMZ from the extracellular space to the cytoplasm of a cell. On
the other hand, DNA damage caused by TMZ activates TP53 thereby dysregulating apoptotic pathways.
Thus, the presented analysis demonstrates that the downstream expression changes are, most likely, the
manifestation of the selection pressure in ABCB1 induced by TMZ treatment.
Protein-protein interaction (PPI) networks representing physical interactions now include thousands
of proteins and over a million (undirected) interactions between them. Regulatory networks on the other
hand represent gene/protein regulation occurring at multiple levels of biological systems through directed
links. Since available regulatory networks are very limited in size and scope, our study focuses on PPI
networks. However, HIT’nDRIVE can easily be applied to regulatory networks as they grow in size and
scope. In addition, the use of multi-hitting time as a distance measure between two or more driver genes
and a target gene enables HIT’nDRIVE to capture synthetic rescue like scenarios; this is ideally suited
for undirected PPI networks, but in principle can be extended to regulatory networks in the future.
HIT’nDRIVE is a driver gene prioritization tool that is flexible enough to incorporate different types
43
of -omics data. Both principles under RWFL and HIT’nDRIVE can be utilized to identify the causal
genes in different complex disease facing analogous problems to cancer. Finally, we believe that appli-
cations of RWFL problem may extend beyond its application to driver gene identification - to influence
analysis in social networks, disease networks and others.
3.5 Methods
3.5.1 Datasets and analysis
We used publically available datasets of four major cancer-types GBM [175], OV [176], BRCA [177],
and PRAD [178] from TCGA project. Details can be found in Section 2.6.
3.5.2 Genomics of drug sensitivity in cancer
Somatic mutation, copy-number alterations and gene-expression, and drug screening data of cancer cell
lines were downloaded from Genomics of Drug Sensitivity in Cancer (GDSC) [80] website
http://www.cancerrxgene.org/downloads. Data downloaded on August 2016.
3.5.3 Pathway enrichment analysis
The selected set of genes were tested for enrichment against gene sets of pathways present in Molecular
Signature Database (MSigDB) v5.0 [162]. A Fisher’s exact test based gene set enrichment analysis was
used for this purpose. A cut-off threshold of false discovery rate (FDR)≤ 0.01 was used to obtain the sig-
nificantly enriched pathways. An R implementation of GESD test is available at https://github.com/raunakms/GSEA-
Fisher. Same procedure, as above, is used to assign biological functional to the gene-modules.
3.5.4 Association of driver modules with patients’ survival outcome
To test for association of driver modules with patients’ survival outcome, we developed a risk-score based
on multi-gene (component genes of the module) expression. The risk-score (S) defined as a weighted
sum of the normalized gene-expression values of the component genes in the module weighted by their
estimated univariate Cox proportional-hazard regression coefficients [15] as given in the equation below.
S =k
∑i
βixi j
44
Here i and j represents a gene and a patient respectively, βi is the coefficient of cox regression for gene
i, xi j is the normalized gene-expression of gene i in patient j, and k is the number of component genes
in a gene-module. The normalized gene-expression values were fitted against overall survival time with
living status as the censored event using univariate Cox proportional-hazard regression (Exact method).
Based on the risk-score values, patients were stratified into two groups: low-risk group (patients with
S < 33 percentile of S), and high-risk group (patients with S > 66 percentile of S). Patients that fall in
between (i.e. patients with S >= 33 percentile of S and <= 66 percentile of S) were discarded from the
further analysis as these patients fall into intermediate-risk group and are bound to introduce noise while
performing log-rank test.
Both Cox regression coefficients of each gene and risk-score cutoff values for each module were
estimated from TCGA-BRCA cohort (training dataset), later these values were applied to METABRIC
cohots (test dataset). To assess whether the risk-score assignment to high/low categories was valid, a
log-rank test was performed for each module in both training and test datasets.
Finally, to identify the significant list of driver-modules that were robust enough to predict patients’
survival, we calculated log-rank test pvalue, hazard-ratio (HR) (Wald test) and concordance-index (c-
index) (Wald test).
45
Figure 3.1: Summary of driver genes prioritized by HIT’nDRIVE. (A) Distribution of predicteddriver genes in cancer genes databases. CGC database contains genes for which mutationshave been causally implicated in cancer. Genes curated in CGC database represents likelydrivers of cancer. COSMIC is a comprehensive database of somatic mutations that have beenreported in different cancers. However, every gene present in COSMIC database may notrepresent drivers of cancer. (B) Distribution of driver genes in druggable genes databases.Actionable genes in cancer therapy were derived from TARGET database. List of druggablegenes were extracted from DGI database. (A-B) The numbers in the panel represent thenumber of genes in respective categories. (C) Distribution of patient druggability. Patientdruggability was accessed using information in TARGET and DGI databases. The numbersin the panel represent the number of patients in respective categories.
46
Figure 3.2: Network properties of driver genes. (A) The centrality of the predicted drivers inSTRING v10 network. The size of the circles is proportional to the alteration frequencyof the driver gene. The color scale represents the total influence of the driver gene on theexpression outliers. (B) Correlation between influence and centrality. Each dot represents atarget node receiving certain amount of influence from all source nodes in the network. Alowess regression line is represented in blue. (C) Correlation between incoming and outgoinginfluence of a node. Each dot represents a node in the network and the color scale representsits betweenness centrality. A linear regression line is represented in blue. (D) Boxplot of thetotal influence of driver genes predicted by HIT’nDRIVE on the expression outliers comparedto that of other altered genes (genes not predicted as drivers). (E) Correlation between geneinfluence and its alteration frequency in the respective patient cohort. (F) Relative influence ofdriver genes in each patient in GBM cohort with mutation in ABCB1. (G) Relative influenceof driver genes in each patient in PRAD cohort with mutation in BRAF. (All gene influencevalues have been multiplied by 105 before log transformation.)
47
Figure 3.3: BRCA subtype classification using driver modules. (A) Performance accuracy of clas-sifying different subtypes for breast cancer using activity-score of subtype specific drivermodules as features in three distinct datasets. (B) Box plot comparing subtype specific driver-seeded modules and driver-free modules with respect to three distinct measures - log-rank testpvalue, hazard-ratio (HR) and concordance-index (c-index). (C) A BRCA subtype specificdriver module (BASAL-02) seeded by NCOA3 that distinguished Basal subtype from rest ofthe BRCA subtypes. (D) Activity-score of BASAL-02 module across different BRCA sub-types. (E) Kaplan-Meier plot showing the significant association of BASAL-02 module withpatients’ clinical outcome in the three datasets considered.
48
Figure 3.4: Drug efficacy predicted by HIT’nDRIVE seeded driver genes. (A) Accuracy of drug-response phenotype classification for all 265 drugs used in GDSC study across 25 cancer types(the remaining 5 cancer types for which only a very limited number of cell lines have beenmade available are statistically insignificant and thus have not been used). The classificationaccuracy for each drug on each cancer type is measured based on the collective use of at most10 best discriminating modules, i.e. the accuracy is maximized across the range of 1 to 10(best discriminating) modules. Note that many of the drugs were not tested on all cancer types;in fact for the vast majority of cancer types only a handful of drugs were tested. (B) Classifi-cation accuracy of modules that distinguish the drug-response phenotypes after treatment withGefitinib in BRCA cell-lines (top-panel), Temozolomide in GBM cell-lines (middle-panel),and Nutlin-3a in OV cell-lines (bottom-panel). Important genes identified in the modules andinvolved in the dysregulated signalling pathways have been highlighted. (C-E) The figuresrepresent the dysregulated signalling pathways in the respective drug perturbation.
49
Chapter 4
Integrated multi-omics molecularsubtyping predicts therapeuticvulnerability in malignant peritonealmesothelioma
4.1 IntroductionMalignant mesothelioma is a rare but aggressive cancer that arises from internal membranes lining of
the pleura and the peritoneum. While the majority of mesotheliomas are pleural in origin, peritoneal
mesothelioma (PeM) accounts for approximately 10-20% of all mesothelioma cases. PeM emerges from
mesothelial cells lining of the peritoneal/abdominal cavities. The incidence rate of PeM is estimated
to be less than 0.5 per 100,000 with 400-800 cases reported annually in the United States of America
alone [172]. Occupational asbestos exposure is a significant risk factor in the development of Pleural
Mesothelioma (PM). However, epidemiological studies suggest that unlike PM, asbestos exposure plays
a far smaller role in the etiology of PeM tumors [172].
Mesothelioma is typically diagnosed in the advanced stages of the disease. A combination of Cytore-
ductive surgery (CRS) and Hyperthermic intraperitoneal chemotherapy (HIPEC), sometimes followed by
Normothermic intraperitoneal chemotherapy (NIPEC) has recently emerged as a first-line treatment for
PeM [163]. However, even with this regime, complete cytoreduction is hard to achieve and death ensues
for most patients. Actionable molecular targets for PeM critical for precision oncology remains to be
50
defined. Immune checkpoint blockade therapy in PM has recently caught much attention given 20-40%
of PM cases reported as inflammatory phenotype [174]. Although, clinical trials typically lump PeM and
PM together for immune checkpoint blockade [25–27, 56, 110], no study has yet provided any rationale
why PeM should be considered for immunotherapy.
Studies investigating genetic abnormalities of PeM [5, 34, 83, 85, 99, 151, 158, 184] have revealed
recurrent copy-number losses of CDKN2A on 9p21, NF2 on 22q and BAP1 on 3p21. In addition, these
studies also reported recurrent mutations in BAP1, SETD2, and DDX3X. However, downstream conse-
quences of these genomic alterations in PeM has not been investigated in great detail. Genomic informa-
tion alone is unlikely to successfully uncover candidate therapeutic targets if not analyzed in the context
of transcriptomes and proteomes.
In this study, we performed an integrated analysis of the genome, transcriptome, and proteome of 19
PeM tumors predominantly of epithelioid subtype.
4.2 Our ContributionsWe present a first-in-field comprehensive integrative multi-omics analysis of a patient cohort of treatment-
naive PeM [156]. In a novel contribution, using HIT’nDRIVE, we identified PeM with BAP1 loss to form
a distinct molecular subtype characterized by distinct gene expression patterns of chromatin remodeling,
DNA repair pathways, and immune checkpoint receptor activation. We also demonstrate that this subtype
is correlated with inflammatory tumor microenvironment and thus a candidate for immune checkpoint
blockade therapies. Our findings reveal BAP1 to be a trackable prognostic and predictive biomarker
for PeM immunotherapy that refines PeM disease classification. This is significant because almost half
of PeM cases are now candidates for these therapies. BAP1 stratification may improve drug response
rates in ongoing phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies
in PeM in which BAP1 status is not considered. This integrated molecular characterization provides a
comprehensive foundation for improved management of a subset of PeM patients.
Our another novel and significant contribution is that we resolved the large discordance between
mRNA and protein expression patterns in PeM cohort. Most of this discordance is attributed to chromatin
remodeling genes and proteins linked to multimeric protein complex. The majority of which are direct
protein-interaction partners of BAP1. The discordance between the mRNA and the protein expression
patterns is most likely due to the ubiquitination and degradation of proteins in these BAP1 regulated
complexes to maintain functional stoichiometry.
51
4.3 Results
4.3.1 Patient Cohort description
We assembled a cohort of 19 tumors from 18 patients (here we refer to it as VPC-PeM) undergoing
CRS at Vancouver General Hospital (Vancouver, Canada), Mount Sinai Hospital (Toronto, Canada), and
Moores Cancer Centre (San Diego, California, USA). We obtained 19 fresh-frozen primary treatment-
nave PeM tumors and adjacent benign tissues or whole blood from the 18 cancer patients. For one patient,
MESO-18, two tumors from distinct sites were available. Immunohistochemical staining on tissues using
different biomarkers were evaluated by two independent pathologists. Both pathologist categorized all
19 tumors as epithelioid PeM with a content of higher than 75% tumor cellularity. To the best of our
knowledge this is the largest cohort of PeM subjected to an integrative multi-omics analysis.
4.3.2 Landscape of somatic mutations in PeM
To investigate the heterogeneity of somatic gene mutations in VPC-PeM, we performed high-coverage
exome sequencing (Ion Proton Hi-Q) of 19 tumors and 16 matched normal samples. We achieved a mean
coverage of 180x for cancerous samples and 96x for non-cancerous samples, with at least 43-77% of tar-
geted bases having a coverage of 100x. We identified 346 unique non-silent mutations (313 of which
were not previously reported in COSMIC [58]) affecting 202 unique genes. We observed an average of
0.015 protein-coding non-silent mutations per Mb per tumor sample. Patient MESO-18 had the highest
mutation burden (0.04 mutations per Mb) and MESO-11 had the least mutation burden (0.001 muta-
tions per Mb). The non-silent mutation burden in PeM is low compared to other adult cancers including
many abdominal cancers (Figure 4.1A), with the exception of prostate adenocarcinomas (PRAD), kid-
ney chromophobe carcinomas (KICH), and testicular germ cell tumors (TGCT). Notably, the mutation
burden in PeM was fairly similar to PM as well as pancreatic adenocarcinomas (PAAD). We also as-
sessed the mutational process that contribute to alterations in tumors. Analysis of base-level transitions
and transversions at mutated sites showed that C>T transitions were predominant in PeM (Figure 4.1B).
Using the software deconstructSigs [143], we found that mutational signature 1, 5, 12, and 6 were oper-
ative in PeM. Interestingly, signature 1 was often correlated with age at diagnosis, and signature 6 was
associated with DNA mismatch repair and mostly found in microsatellite instable tumors [7].
We first identified driver genes of PeM using our recently developed algorithm HIT’nDRIVE [155].
Briefly, HIT’nDRIVE measures the potential impact of genomic aberrations on changes in the global
expression of other genes/proteins which are in close proximity in a gene/protein-interaction network. It
then prioritizes those aberrations with the highest impact as cancer driver genes. HIT’nDRIVE priori-
52
tized 25 unique driver genes in 15 PeM tumors for which matched genome and transcriptome data were
available (Figure 4.1C). Six genes (BAP1, BZW2, ABCA7, TP53, ARID2, and FMN2) were prioritized as
drivers, each harboring single nucleotide changes.
The mutation landscape of PeM was found to be highly heterogeneous. The nuclear deubiquitinase
BAP1 was the most frequently mutated gene (5 out of 19 tumors) in PeM tumors. BAP1 is a tumor-
suppressor gene known to be involved in chromatin remodeling, DNA double-strand break repair, and
regulation of transcription of many other genes [19]. Previous studies have also reported BAP1 as the
most frequently mutated gene in both PeM [5, 85] and PM [19, 24]. The BAP1 missense mutation in
MESO-18A/E resulted in a single amino-acid (AA) change in the ubiquitin carboxyl hydrolase domain
keeping the rest of the amino acid chain intact. In MESO-06 and MESO-09, a BAP1 frameshift deletion
resulted in a premature stop codon and chain termination. In MESO-09 approximately 91% of BAP1
amino acid chains were intact, but in MESO-06 only 2% of BAP1 amino acid chains were intact. We also
observed a BAP1 germline mutation in only one case (MESO-09). In three (15%) tumors, we identified
a recurrent R396I mutation in ZNF678 - a zinc finger protein containing zinc-coordinating DNA binding
domains involved in transcriptional regulation. We compared the mutated genes in our VPC-PeM cohort
with publically available datasets [1, 5, 24] of both PeM and PM. BAP1 was the only mutated gene
common between the three PeM cohorts. Twenty-five genes including tumor suppressors LATS1, TP53,
and chromatin modifiers SETD2 were common between at least two PeM cohorts. Many mutated genes
in VPC-PeM were also previously reported in PM. BAP1 and SETD2 were the two mutated genes found
common between VPC-PeM and all four PM cohorts.
4.3.3 Copy number landscape in PeM
To investigate the somatic CNA profiles of PeM, we derived copy-number calls from exome sequencing
data using the software Nexus Copy Number Discovery Edition Version 8.0. The aggregate CNA profile
of PeM tumors is shown in Figure 4.2A-B. We observed a total of 1,281 CNA events across all samples.
On an average, 10% of the protein-coding genome was altered per PeM tumor. MESO-14 had the
highest CNA burden (42%) whereas MESO-11 had the least (0.01%). Interestingly, both mutation and
CNA burden in PeM was strongly correlated (R = 0.74).
We also compared the CNA burden in protein-coding regions of the VPC-PeM cohort with different
adult cancers from TCGA project. Similar to the mutation burden, VPC-PeM tumors were observed
at the lower end of the pan-cancer CNA burden spectrum. Only UCEC, PRAD, and PAAD tumors
had lower median CNA burden as compared to PeM tumors (Figure 4.2C). CNA status and mRNA
expression for around half of the genes were positively correlated (R ≥ 0.1) and 16% of the genes had
53
strong correlation (R ≥ 0.5). To identify cancer genes, we compared aberrations in protein-coding genes
with data from the CGC. Intriguingly, CNA status and mRNA expression for majority of CGC genes
were positively correlated.
To identify recurrent focal CNAs in PeM tumors, we used the GISTIC [115] algorithm which yielded
5 regions of focal deletions (q < 0.05) including in 3p21 and 22q13 which are characteristic of malig-
nant mesotheliomas (Figure 4.2D). Furthermore, GISTIC prioritized 8 regions of focal amplification (q
< 0.05) which included genes such as IGH, VEGFD, BRD9, FOXL1, EGFR, and PDGFA (Figure 4.2D).
Copy-number status of these genes was also significantly correlated with their respective mRNA expres-
sion. Chromosome 1 was the most aberrant region in PeM and chromosomes 13 and 18 were relatively
unchanged except for MESO-14 (Figure 4.2B).
Using HIT’nDRIVE, we identified genes in chromosome 3p21, BAP1, PBRM1, and SETD2, as key
driver genes of PeM (Figure 4.1C). Chromosome 3p21 was deleted in almost half of the tumors (8
of 19) in the cohort. Here, we call tumors with 3p21 (or BAP1) loss as BAP1del and the rest of the
tumors with 3p21 (or BAP1) copy-number intact as BAP1intact. Interestingly, BAP1 mRNA transcripts
in BAP1del tumors were expressed at lower levels as compared to those in BAP1intact tumors (Wilcoxon
signed-rank test p-value = 3x10-4) (Figure 4.2E). We validated this using Immunohistochemical (IHC)
staining demonstrating lack of BAP1 nuclear staining in the tumors with BAP1 homozygous deletion
(Figure 4.2F). Tumors with BAP1 heterozygous loss still displayed BAP1 nuclear staining. We observed
three BAP1 mutated cases among BAP1intact tumors. BAP1 mRNA transcripts in these three tumors,
were expressed at high levels. As mentioned in the previous section, the mutation analysis also predicted
that despite mutation in BAP1 in these three tumors, the entire BAP1 amino-acid chain is still intact
and may be functionally active. Furthermore, we found DNA copy loss of 3p21 locus to include four
concomitantly deleted cancer genes - BAP1, SETD2, SMARCC1, and PBRM1, consistent with [208].
Copy-number status of these four genes was significantly correlated with their corresponding mRNA
expression, suggesting that the allelic loss of these genes is associated with decreased transcript levels.
These four genes are chromatin modifiers, and PBRM1 and SMARCC1 are part of SWI/SNF complex
that regulates transcription of a number of genes.
CNA status of genes associated with key cancer pathways was observed to be different between the
PeM subtypes (i.e. BAP1del and BAP1intact). We observed many genes involved in chromatin remodeling,
SWI/SNF complex and DNA repair pathway to be deleted in BAP1del tumors as compared to BAP1intact
tumors (Figure 4.1C). In contrast, we found copy-number gain of many genes in D NA repair path-
ways (BRCA2, ATM, MGMT, and RAD51) and the cell cycle (MYC, CDK5, CCNB1, and CCND1) in the
BAP1intact tumors. Furthermore, PeM tumors (both BAP1del and BAP1intact) harbored CNA events in car-
cinogenic pathways such as MAPK, PI3K, MTOR, Wnt, and Hippo pathways. Interestingly, ESR1 copy
54
number deletion is enriched in BAP1del tumors while co-amplification of EGFR and BRAF were present
in three BAP1intact tumors. Notably, we identified copy-number loss of tumor suppressor LATS1/2 and
copy-number gain of NF2 in one case, both of which has been previously associated with mesotheliomas
[24], in BAP1del tumors. Notably, both LATS1/2 and NF2 are key regulators of the Hippo pathway [105].
Unsupervised consensus clustering of tumor samples based on copy-number segmentation mean
values of the 3349 most variable genes identified four tumor sub-groups (Figure 4.2G)). We observed
that BAP1del and BAP1intact tumors were grouped into distinct clusters. This indicates that BAP1del
tumors have distinct copy-number profiles from those of BAP1intact tumors. We identified 692 genes
(p-value < 0.01, Kruskal-Wallis test) with significantly differential CNA genes segments between the
clusters. These genes were mapped to eight distinct chromosome loci 19p, 6q, 1q, and, 13q and were
mostly gained in clusters 1 and 3, whereas Xq, 22q, and 7p loss were mostly in clusters 1 and 3.
4.3.4 Gene fusions in PeM
To identify the presence of gene fusions, we analyzed RNA-seq data in 15 PeM using deFuse algorithm
[114]. Overall, 82 unique gene fusion events were identified using our filtering criteria (see Methods),
out of which we successfully validated 18 gene fusions using Sanger sequencing. We observed more
gene fusion events in BAP1del tumors as compared to that in BAP1intact tumors (Figure 4.3A-B).
Notably, BAP1, SETD2, PBRM1, and KANSL1 were prioritized as a driver gene by HIT’nDRIVE on
basis of gene-fusion. Fusions in these genes were mostly found in the BAP1del subtype. MTG1-SCART1
was the most recurrent gene fusion observed in 7 cases. MTG1 regulates mitochondrial ribosome that
synthesize proteins essential for oxidative phosphorylation. SCART1 is a pseudogene predicted to act
as a co-receptor of certain T-cells. This was followed by GKAP1-KIF27 and KANSL1-ARL17B (Figure
4.3C) each of which was identified in 6 different cases. Three unique fusions were present in PBRM1, 2
in KANSL1, and 1 each in BAP1 and SETD2 all of which are involved in chromatin remodeling process
(Figure 4.1C and 4.3D-F).
4.3.5 The global transcriptome and proteome profile of PeM
To segregate transcriptional subtypes of PeM, we performed total RNA-seq (Illumina HiSeq 4000) and its
quantification of 15 PeM tumor samples for which RNA were available (RNA for remaining four tumor
samples did not pass the quality control checks). We first performed principal-component analyses and
unsupervised consensus clustering of all PeM tumors to determine transcriptomic patterns using genes
based on variance among tumor specimens. Consensus clustering revealed two distinct transcriptome
sub-groups (Figure 4.4A). We found BAP1intact and BAP1del have some distinct transcriptomic patterns;
55
however, a few samples showed an overlapping pattern.
We performed mass spectrometry (Fusion Orbitrap LC/MS/MS) with isobaric tagging for expressed
peptide identification and its corresponding protein quantification using Proteome Discoverer for pro-
cessing pipeline for 16 PeM tumors and 7 matched normal tissues (matched normal samples for the
remaining tumors were not available). We identified 8242 unique proteins in 23 samples analyzed (we
were surprised BAP1 protein was however not detected in our MS experiment, likely due to inherent
technical limitations with these samples and/or processing. Quality control analysis of in solution Hela
digests also have very low BAP1 with only a single peptide observed in occasional runs). First, we ana-
lyzed global matched mRNA-protein expression correlation. Although, 58% (4715 of 8109) of proteins
showed positive mRNA-protein correlation (Pearson correlation; R ≥ 0.1), only 22.7% (1839) of the
proteins were strongly correlated with their corresponding mRNA (R ≥ 0.5). Expression of 2.4% (194)
of proteins strongly negatively correlated with their corresponding mRNA (R ≤ -0.5). To analyze the
proteomic pattern across PeM tumors, we performed principal-component analyses and unsupervised
consensus clustering following the same procedure as described above for the transcriptome. Unlike in
transcriptome profiles, the proteome profiles of BAP1 PeM tumor sub-types did not group into distinct
clusters (Figure 4.4B).
To identify Differentially expressed genes (DEG) between BAP1intact and BAP1del, we performed
Wilcoxon signed-rank test using mRNA and protein expression data independently. We identified 1520
and 466 DEG (with p-value < 0.05) using mRNA and protein expression data respectively. However,
only 53 genes were found common between the two sets of DEG. As expected, BAP1, PBRM1 and
SMARCA4, SMARCD3 were among the top-500 DEG. Many other important cancer-related genes were
differentially expressed such as CDK20, HIST1H4F, ERCC1, APOBEC3A, CDK11A, CSPG4, TGFB1,
IL6, LAG3, and ATM.
4.3.6 Transcriptional and post-transcriptional mechanisms regulate chromatinremodeling protein-complexes
Next, we aimed to study the extent to which changes in copy number profile affects its corresponding
protein expression. For this, we calculated Pearson correlation between CNA-mRNA expression and
CNA-protein expression. While, copy number profile of genes, on average, have good agreement with
their corresponding mRNA expression, a number of detected proteins had poor correlation with their
respective gene’s copy number profile. Approximately 25% (1871 of 7462) of proteins were observed
to have poor correlation with their genes copy number which we here define as “attenuated proteins”
(Methods, Figure 4.4C). Among the attenuated proteins, we identified important chromatin remodeling
56
proteins - PBRM1, SETD2, and SMARCC1. The attenuated proteins also included cancer genes such
as NF2, EGFR, APC, PIK3CA, and MAP3K4. We observed that the attenuated proteins were signifi-
cantly enriched with direct protein-protein interaction partners of the UBC (hypergeometric test p-value:
10-5), BAP1 (10-3), and PBRM1 (10-2) in STRING v10 interaction network. Notably, geneset enrich-
ment analysis revealed that attenuated proteins are more likely to form a part of a multimeric complex
or bind to macromolecules (Figure 4.4D). These results corroborate previous findings from studies an-
alyzing breast, ovarian and colorectal cancer datasets [62]. These attenuated proteins were found to be
involved in mRNA processing, DNA repair pathway, cell cycle regulation, the immune system, and in
carbohydrate and lipid metabolism. Strikingly, we found that DEG between the PeM subtypes are sig-
nificantly associated with protein attenuation (Chi-Squared test p-value: 10-4 using mRNA expression
DEG, 10-6 using protein expression DEG). These findings suggest that the effects of CNA are attenuated
at the protein level via post-transcriptional modification.
To identify large protein complexes containing the attenuated proteins and that are variable (i.e. at
least a protein subunit of the complex is differentially expressed) between PeM subtypes, we leveraged a
manually curated set of core protein complexes from the CORUM database [145]. These included many
protein complexes involved in DNA conformation modification, DNA repair, transcriptional regulation,
post-translational modification including ubiquitination. Using our data, we observed that the majority
of the protein complexes were highly co-regulated at the protein level rather than at the mRNA level.
Notably, we identified SWI/SNF (BAF and PBAF) and HDAC complex which were highly co-regulated
(Figure 4.4E-G). We found copy-number deletion in many subunits of SWI/SNF complex, mostly in
the BAP1del subtype (Figure 4.1C). About one quarter of proteins in the BAF complex and half of pro-
teins in PBAF were attenuated. PBRM1 was both attenuated at the protein level as well as differentially
expressed between PeM subtypes. SMARCB1, and SMARCA4 were also differentially expressed be-
tween PeM subtypes in this complex (Figure 4.4H). We further identified a number of HDAC complex
components as highly co-regulated. The complex consisted of Histone deacetylase (HDAC1/2), which
regulates expression of a number of genes through chromatin remodeling. About one-third of protein
subunits in the complex were attenuated at the protein level. More importantly, HDAC1, CHD4 and
ZMYM2 were differentially expressed between PeM subtypes in the protein complex, and different fam-
ily members of HDAC protein family were highly expressed in the BAP1del subtype (Figure 4.4I). This
indicates potential use of HDAC inhibitors to suppress the tumor growth in the BAP1del subtype. We
note that both SWI/SNF and HDAC complexes interact with BAP1. Expression pattern of many subunits
of these complexes were either highly correlated or highly anti-correlated with BAP1 expression (Figure
4.4E-G). Although mRNA transcripts are transcribed proportional to the changes in copy-number profile
of the gene, the corresponding proteins are often stabilized when in complex, and free proteins in excess
57
are usually ubiquitinated and targeted for proteosomal degradation to maintain stoichiometry [62].
4.3.7 BAP1del subtype is characterized by distinct expression patterns of genes involvedin DNA repair pathway, and immune checkpoint receptor activation
To identify the pathways dysregulated by the DEG between the PeM subtypes, we performed hyper-
geometric test based geneset enrichment analysis (Methods) using the REACTOME pathway database.
Intriguingly, we observed high concordance between pathways dysregulated by the two sets (mRNA
and protein expression data) of top-500 DEG (Figure 4.5A-B). The unsupervised clustering of path-
ways revealed two distinct clusters for BAP1del and BAP1intact tumors. This indicates that the enriched
pathways, between the patient groups, are also differentially expressed. BAP1del patients demonstrated
elevated levels of RNA and protein metabolism as compared to BAP1intact patients. Many genes in-
volved in chromatin remodeling and DNA damage repair were differently expressed between the groups.
Our data suggests that BAP1del tumors have repressed DNA damage response pathways. Most impor-
tantly, protein expression data revealed that PARP1 is highly expressed in BAP1del tumors as compared
to BAP1intact tumors indicating potential inhibition of PARP1 for BAP1del tumors. Genes involved in
cell-cycle and apoptotic pathways were observed to be highly expressed in BAP1del patients. Further-
more, glucose and fatty-acid metabolism pathways were repressed in BAP1del as compared to BAP1intact.
More interestingly, we observed a striking difference in immune-system associated pathways between
the PeM subtypes. Whereas BAP1del patients demonstrated strong activity of cytokine signaling and the
innate immune system; MHC-I/II antigen presentation system and Adaptive immune system were active
in BAP1intact patients.
Prompted by this finding, we next analyzed whether PeM tumors were infiltrated with leukocytes. To
assess the extent of leukocyte infiltration, we computed an expression (RNA-seq and protein) based score
using the immune-cell and stromal markers proposed by [206]. We discovered that the immune marker
gene score was strongly correlated with stromal marker gene score (Methods and Figure 4.5C-D). Using
CIBERSORT [124] software, we computationally estimated leukocytes representation in the bulk tumor
transcriptome. We observed massive infiltration of T cells cells in majority of the PeM tumors (Figure
4.5E). A subset of PeM tumors had massive infiltration of B-cells in addition to T cells. Interestingly,
when we group the PeM tumors by their BAP1 aberration status, there was a marked difference in the
proportion of infiltrated plasma cells, natural killer (NK) cells, mast cells, T cells and B cells between
the groups. Whereas the proportions of plasma cells, NK cells and B cells were less in the BAP1del
tumors, there was more infiltration of mast cells and T cells were in BAP1del tumors as compared to
BAP1intact tumors. We performed Tissue microarray (TMA) IHC staining of CD3 and CD8 antibody
58
on PeM tumors. We observed that BAP1del PeM tumors were positively stained for both CD3 and CD8
confirming infiltration of T cells in BAP1del PeM tumors (Figure 4.5F). Combined, this strongly indicates
that leukocytes from the tumor-microenvironment infiltrates the PeM tumor.
Finally, we surveyed the PeM tumors for expression of genes involved in immune checkpoint path-
ways. A number of immune checkpoint receptors were highly expressed in BAP1del tumors relative to
BAP1intact tumors. These included CD274 (PD-L1), CD80, CTLA4, LAG3, and ICOS (Figure 4.5G) for
which inhibitors are either clinically approved or are at varying stages of clinical trials. Gene expres-
sion of these immune checkpoint receptors were highly correlated with immune score (Figure 4.5H).
Moreover, a number of MHC genes, immuno-inhibitor genes as well as immuno-stimulator genes were
differentially expressed between BAP1del and BAP1intact tumors. Furthermore, we analyzed whether the
immune checkpoint receptors were differentially expressed in tumors with and without 3p21 loss in PM
tumors from TCGA. Unlike in PeM, we did not observe a significant difference in immune checkpoint
receptor expression between the PM tumor groups (i.e. BAP1del and BAP1intact). These findings suggest
that BAP1del PeM tumors could potentially be targeted with immune-checkpoint inhibitors while PM
tumors may less likely to respond.
4.4 DiscussionIn this study, we present a comprehensive integrative multi-omics analysis of malignant peritoneal mesothe-
liomas. Even though this is a rare disease we managed to amass a cohort of 19 tumors. Prior studies of
mesotheliomas, performed using a single omic platform, have established loss of function mutation or
copy-number loss of BAP1 as a key driver event in both PeM and PM. Our novel contribution to PeM is
that we provide evidence from integrative multi-omics analyses that BAP1 copy number loss (BAP1del)
forms a distinct molecular subtype of PeM. This subtype of PeM is characterized by distinct expression
patterns of genes involved in chromatin remodeling, DNA repair pathway, and immune checkpoint ac-
tivation. Moreover, BAP1del subtype has inflammatory tumor microenvironment. Our results suggest
that BAP1del tumors might be prioritized for immune checkpoint blockade therapies. Thus BAP1 may
be both a prognostic and predictive biomarker for PeM enabling better disease classification and patient
treatment.
Structural alterations in PeM tumors were found to be highly heterogeneous, and occur at a lower rate
as compared to most other adult solid cancers. The majority of SNVs and CNAs were typically unique
to a patient. However, many of these alterations were non-randomly distributed to critical carcinogenic
pathways. We observed many alterations in genes involved in chromatin remodeling, SWI/SNF complex,
cell cycle and DNA repair pathway. SWI/SNF complex is an ATP-dependent chromatin remodeling
59
complex known to harbor aberrations in almost one-fifth of all human cancers [84]. Our results show that
SWI/SNF complex is differentially expressed between PeM subtypes which further regulates oncogenic
and tumor suppressive pathways. Notably, we also identified another chromatin remodeling complex -
HDAC complex which is differentially expressed between PeM subtypes. HDAC, known to be regulated
by BAP1, is a potential therapeutic target for the BAP1del PeM subtype. Recent in-vitro experiments
demonstrated BAP1 loss altered sensitivity of PM as well as uveal melanoma (UM) cells to HDAC
inhibition [95, 146]).
Loss of BAP1 is known to alter chromatin architecture exposing the DNA to damage, and also im-
pairing the DNA-repair machinery [81, 210]. Similar to BRCA1/2 deficient breast and ovarian cancers,
BAP1 deficient PeM tumors most likely depends on PARP1 for survival. This rationale can be utilized to
test PARP inhibitors in BAP1del PeM subtype. The DNA repair defects thus drive genomic instability and
dysregulate tumor microenvironment [121]. DNA repair deficiency leads to the increased secretion of cy-
tokines, including interferons that promote tumor-antigen presentation, and trigger recruitment of both T
and B lymphocytes to destroy tumor cells. As a response, tumor cells evade this immune-surveillance by
increased expression of immune checkpoint receptors. The results presented here also indicate that PeM
tumors are infiltrated with immune-cells from the tumor microenvironment. Moreover, the BAP1del sub-
type displays elevated levels of immune checkpoint receptor expression which strongly suggests the use
of immune checkpoint inhibitors to treat this subtype of PeM. However, in a small subset of PM tumors
in TCGA dataset, the loss of BAP1 did not elevate expression of immune checkpoint marker genes. This
warrants further investigation on the characteristics of these groups of PM tumors. Furthermore, recently,
BAP1 loss has been defined as a distinct molecular subtype of clear cell renal cell carcinoma (ccRCC)
and UM [33, 132, 142]. These studies showed that, similar to BAP1del PeM subtype, BAP1del tumors
from both ccRCC and UM also have dysregulated chromatin modifiers, impaired DNA repair pathway,
and immune checkpoint receptor activation. More recent studies in ccRCC [116] and melanoma [127]
demonstrated that inactivation of PBRM1 (or PBAF complex) predicts response to immune checkpoint
blocking therapies. Similarly, DNA repair defects have also been shown to be predictive of response to
immune checkpoint blocking therapies [60, 97, 98]. This strongly indicates a pan-cancer mechanism of
oncogenesis shared among tumors with BAP1 copy-number loss.
The main challenge in mesothelioma treatment is that, all current efforts made towards testing new
therapy options are limited to using therapies that have been proven successful in other cancer types,
without a good knowledge of underlying molecular mechanisms of the disease. As a result of sheer
desperation, some patients have been treated even though no targeted therapy for mesothelioma has
been proven effective as yet. For example, a number of clinical trials exploring the use of immune
checkpoint inhibitors (anti-PD1/PD-L1 or anti-CTLA4) in PM and/or PeM patients that progressed under
60
chemotherapy, and are positive for immune checkpoint markers are currently under progress. The results
of the first few clinical trials report either very low response rate or no benefit to the patients [9, 25, 26,
110]. Notably, BAP1 copy-number or mutation status were not assessed in these studies. We believe that
response rates for immune checkpoint blockade therapies in clinical trials for PeM will improve when
patients are segregated by their BAP1 copy-number status.
4.5 Methods
4.5.1 Clinical samples and pathology evaluation
Primary untreated PeM tumors and matched benign samples were obtained from cancer patients under-
going cytoreductive surgeries following protocols approved by the Clinical Research Ethics Board of the
Vancouver General Hospital (Vancouver, BC, Canada), Mount Sinai Hospital (Toronto, ON, Canada),
and Moores Cancer Centre (San Diego, CA, USA). This study was approved by the Institutional Review
Board of the University of British Columbia and Vancouver Coastal Health (REB No. H15-00902 and
V15-00902). All patients signed a formal consent form approved by the respective institutional ethics
board. Histologic parameters and pathological scoring of tumors confirming PeM was established by
three independent pathologists. H&E and immunostained Formalin-Fixed Paraffin-Embedded (FFPE)
slides were reviewed by at least two specialized pathologists to diagnose PeM and its subtype. Hema-
toxylin and eosin (H&E) staining was used to determine the highest tumor cellularity (≥ 75%) from
sections for sequencing. The surgical resections were snap frozen and processed at respective institu-
tions. The tumors have a companion normal tissue specimen (either adjacent normal tissue or peripheral
blood previously extracted for germline DNA control). Each tumor specimen was approximately 1cm3 in
size and weighed between 100-300 mg. Specimen were shipped overnight on dry ice that maintained an
average temperature of less than -80oC. Upon receipt, the tissues were sectioned into 5 slices for DNA,
RNA, and protein extraction as well as construction of TMA.
4.5.2 Construction of tissue microarrays (TMAs)
FFPE tissue blocks were retrieved from the archives of the Department of Pathology, Vancouver General
Hospital (Vancouver, Canada). H&E stained slides from each block were reviewed by two pathologists
to identify tumor areas. TMAs were constructed with 1 mm diameter tissue cores from representative
tumor areas from FFPE blocks. Cores were transferred to a paraffin block using a semi-automated tissue
array instrument (Pathology Devices TMArrayer, San Diego, CA). Duplicate tissue cores were taken
from each specimen, resulting in a composite TMA block. Reactive mesothelial tissues from pleura
61
were also included as benign controls. Following construction, 4µm thick sections were cut for H&E
and immunohistochemical staining.
4.5.3 Immunohistochemistry and Histopathology
Freshly cut TMA sections were analyzed for immunoexpression using Ventana Discovery Ultra au-
tostainer (Ventana Medical Systems, Tucson, Arizona). In brief, tissue sections were incubated in Tris-
EDTA buffer (CC1) at 37C to retrieve antigenicity, followed by incubation with respective primary anti-
bodies at room temperature or 37C for 60-120 min. For primary antibodies, mouse monoclonal antibod-
ies against CD8 (Leica, NCL-L-CD8-4B11, 1:100), CK5/Cytokeratin 5(Abcam, ab17130, 1:100), BAP1
(SantaCruz, clone C4, sc-28383, 1:50), rabbit monoclonal antibody against CD3 (Abcam, ab16669,
1:100), and rabbit polyclonal antibodies against CALB2/Calretinin (LifeSpan BioSciences, LS-B4220,
1:20 dilution) were used. Bound primary antibodies were incubated with Ventana Ultra HRP kit or Ven-
tana universal secondary antibody and visualized using Ventana ChromoMap or DAB Map detection kit,
respectively. All stained slides were digitalized with the SL801 autoloader and Leica SCN400 scanning
system (Leica Microsystems; Concord, Ontario, Canada) at magnification equivalent to x20. The im-
ages were subsequently stored in the SlidePath digital imaging hub (DIH; Leica Microsystems) of the
Vancouver Prostate Centre. Representative tissue cores were manually identified by two pathologists.
4.5.4 Whole exome sequencing
DNA was isolated from snap-frozen tumors with 0.2 mg/mL Proteinase K (Roche) in cell lysis solution
using Wizard Genomic DNA Purification Kit (Promega Corporation, USA). Digestion was carried out
overnight at 55C before incubation with RNase solution at 37C for 30 minutes and treatment with pro-
tein precipitation solution followed by isopropanol precipitation of the DNA. The amount of DNA was
quantified on the NanoDrop 1000 Spectrophotometer and an additional quality check done by reviewing
the 260/280 ratios. Quality check were done on the extracted DNA by running the samples on a 0.8%
agarose/TBE gel with ethidium bromide.
For Ion AmpliSeqTM Exome Sequencing, 100ng of DNA based on Qubit R© dsDNA HS Assay (Thermo
Fisher Scientific) quantitation was used as input for Ion AmpliSeqTM Exome RDY Library Preparation.
This is a Polymerase Chain Reaction (PCR) based sequencing approach using 294,000 primer pairs (am-
plicon size range 225-275 bp), and covers >97% of Consensus CDS (CCDS; Release 12), >19,000
coding genes and >198,000 coding exons. Libraries were prepared, quantified by Quantitative Poly-
merase Chain Reaction (QPCR) and sequenced according to the manufacturer’s instructions (Thermo
Fisher Scientific). Samples were sequenced on the Ion Proton System using the Ion PITM Hi-QTM Se-
62
quencing 200 Kit and Ion PITM v3 chip. Two libraries were run per chip for a projected coverage of
40M reads per sample.
4.5.5 Somatic variant calling
Torrent Server (Thermo Fisher Scientific) was used for signal processing, base calling, read alignment,
and generation of results files. Specifically, following sequencing, reads were mapped against the hu-
man reference genome hg19 using Torrent Mapping Alignment Program. The mean target coverage
ranges from 78.62 to 226.44, thus sequencing depth ranges from 78 to 226X. Variants were identified
by using Torrent Variant Caller plugin with the optimized parameters for AmpliSeq exome-sequencing
recommended by Thermo Fisher. The Variant Calling Format (VCF) files from all sample were com-
bined using GATK (3.2-2) [47] and all variants were annotated using ANNOVAR [197]. Only non-silent
exonic variants including non-synonymous SNVs, stop-codon gain SNVs, stop-codon loss SNVs, splice
site SNVs and In-Dels in coding regions were kept if they were supported by more than 10 reads and
had allele frequency higher than 10%. To obtain somatic variants, we filtered against dbSNP build 138
(non-flagged only) and the matched adjacent benign or blood samples sequenced in this study. Puta-
tive variants were manually scrutinized on the Binary Alignment Map (BAM) files through Integrative
Genomics Viewer (IGV) version 2.3.25 [179].
4.5.6 Copy number aberration (CNA) calls
Copy number changes were assessed using Nexus Copy Number Discovery Edition Version 8.0 (BioDis-
covery, Inc., El Segundo, CA). Nexus NGS functionality (BAM ng CGH) with the FASST2 Segmentation
algorithm was used to make copy number calls (a Circular Binary Segmentation/Hidden Markov Model
approach). The significance threshold for segmentation was set at 5X10-6, also requiring a minimum of
3 probes per segment and a maximum probe spacing of 1000 between adjacent probes before breaking a
segment. The log ratio thresholds for single copy gain and single copy loss were set at +0.2 and −0.2,
respectively. The log ratio thresholds for gain of 2 or more copies and for a homozygous loss were set
at +0.6 and −1.0, respectively. Tumor sample BAM files were processed with corresponding normal
tissue BAM files. Reference reads per CN point (window size) was set at 8000. Probes were normalized
to median. Relative copy number profiles from exome sequencing data were determined by normalizing
tumor exome coverage to values from whole blood controls. The germline exome sequences were used to
obtain allele-specific copy number profiles and generating segmented copy number profiles. The GISTIC
module on Nexus identifies significantly amplified or deleted regions across the genome. The amplitude
of each aberration is assigned a G-score as well as a frequency of occurrence for multiple samples. False
63
Discovery Rate q-values for the aberrant regions have a threshold of 0.15. For each significant region, a
“peak region” is identified, which is the part of the aberrant region with greatest amplitude and frequency
of alteration. In addition, a “wide peak” is determined using a leave-one-out algorithm to allow for er-
rors in the boundaries in a single sample. The “wide peak” boundaries are more robust for identifying
the most likely gene targets in the region. Each significantly aberrant region is also tested to determine
whether it results primarily from broad events (longer than half a chromosome arm), focal events, or
significant levels of both. The GISTIC module reports the genomic locations and calculated q-values for
the aberrant regions. It identifies the samples that exhibit each significant amplification or deletion, and
it lists genes found in each “wide peak” region.
4.5.7 Transcriptome sequencing (RNA-seq)
Total RNA from 100µm sections of snap-frozen tissue was isolated using the mirVana Isolation Kit from
Ambion (AM-1560). Strand specific RNA sequencing was performed on quality controlled high RIN
value (>7) RNA samples (Bioanalyzer Agilent Technologies) before processing at the high throughput
sequencing facility core at BGI Genomics Co., Ltd. (The Children’s Hospital of Philadelphia, Penn-
sylvania, USA). In brief, 200ng of total RNA was first treated to remove the ribosomal RNA (rRNA)
and then purified using the Agencourt RNA Clean XP Kit (Beckman Coulter) prior to analysis with the
Agilent RNA 6000 Pico Chip to confirm rRNA removal. Next, the rRNA-depleted RNA was fragmented
and converted to cDNA. Subsequent steps include end repair, addition of an ‘A’ overhang at the 3’ end,
ligation of the indexing-specific adaptor, followed by purification with Agencourt Ampure XP beads.
The strand specific RNA library prepared using TruSeq (Illumina Catalogue No. RS-122-2201) was
amplified and purified with Ampure XP beads. Size and yield of the barcoded libraries were assessed
on the LabChip GX (Caliper), with an expected distribution around 260 base pairs. Concentration of
each library was measured with real-time PCR. Pools of indexed library were then prepared for cluster
generation and PE100 sequencing on Illumina HiSeq 4000.
4.5.8 Transcriptome (RNA-seq) quantification
Using splice-aware aligner STAR (2.3.1z) [50], RNA-seq reads ( 200MB in size) were aligned onto
the human genome reference (GRCh38) and exon-exon junctions, according to the known gene model
annotation from the Ensembl release 80 (http://www.ensembl.org). Apart from protein coding genes,
non-coding RNA types and pseudogenes are further annotated and classified. Based on the alignment
and by using gene annotation (Ensembl release 80), gene expression profiles was calculated. Only reads
unique to one gene and which corresponded exactly to one gene structure, were assigned to the corre-
64
sponding genes by using the python tool HTSeq [11]. Normalization of read counts was conducted by R
package DESeq [10], which was designed for gene expression analysis of RNA-seq data across different
samples.
4.5.9 Identification of fusion transcripts and validation
We used the deFuse algorithm [114] to predict rearrangements in RNA sequence libraries. The deFuse
fusion transcript prediction calls were further filtered using following criteria: a fusion gene candidate:
(1) must be predicted to have arisen from genome rearrangement, rather than via a readthrough event; (2)
must be predicted in no more than two sequence libraries; (3) must map unambiguously on both sides
of the predicted breakpoints (that is, no multi-mapping reads); (4) must not map entirely to repetitive
elements; (5) must be detected in >5 reads (either split or spanning) and (6) must have at least one of the
fusion partner transcript expressed.
Prioritized putative gene fusions were verified by designing PCR primers around the predicted fusion
sites. Specifically, Reverse Transcription PCR (RT-PCR) was used to amplify the predicted fusion gene
junctions from the same starting RNA material (100ng) as was used for RNA-seq. Two primers (20-22
bp nucleotides) spanning the exon boundary of fused genes were designed using Primer3 (v. 0.4.0) [186].
PCR was performed in 20µl reactions using Q5 buffer (NEB), 0.2mM dNTPs, 0.4 µM each primer, 0.12
units Q5 High-Fidelity DNA Polymerase (NEB) and 2 µl of the RT reaction. The PCR reaction was
carried out with the following program: 95C, 30 seconds, followed by 30 cycles of 95C for 10 seconds,
57C for 20 seconds and 72C for 10 seconds. Resulting PCR products, ranging in size from 150bp
to 250bp, were purified using AMPure beads (Agencourt) and sequenced using Sanger sequencing to
verify fusion junctions.
4.5.10 Proteomics analysis using mass spectrometry
Fresh frozen samples dissected from tumor and adjacent normal were individually lysed in 50mM of
HEPES pH 8.5, 1% SDS, and the chromatin content was degraded with benzonase. The tumor lysates
were sonicated (Bioruptor Pico, Diagenode, New Jersey, USA), and disulfide bonds were reduced with
DTT and capped with iodoacetamide. Proteins were cleaned up using the SP3 method [78, 79] (Single
Pot, Solid Phase, Sample Prep), then digested overnight with trypsin in HEPES pH 8, peptide concentra-
tion determined by Nanodrop (Thermo) and adjusted to equal level. A pooled internal standard control
was generated comprising of equal volumes of every sample (10µl of each of the 100µl total digests)
and split into 3 equal aliquots. The labeling reactions were run as three TMT 10-plex panels (9+IS), then
desalted and each panel divided into 48 fractions by reverse phase HPLC at pH 10 with an Agilent 1100
65
LC system. The 48 fractions were concatenated into 12 superfractions per panel by pooling every 4th
fraction eluted resulting in a total 36 overall samples.
These samples were analyzed with an Orbitrap Fusion Tribrid Mass Spectrometer (Thermo Fisher
Scientific) coupled to EasyNanoLC 1000 using a data-dependent method with synchronous precursor
selection MS3 scanning for TMT tags. A short description follows; more detailed overview is in [79].
Briefly, an in house packed reverse phase column run with a 2 hour low pH acetonitrile gradient (5-40%
with 0.1% formic acid) was used to separate and introduce peptides into the MS. Survey scans covering
m/z 350-1500 were acquired in profile mode at a resolution of 120,000 (at m/z 200) with S-Lens RF
Level of 60%, a maximum fill time of 50 milliseconds, and Automatic Gain Control (AGC) target of
4x105. For MS2, monoisotopic precursor selection was enabled with triggering charge state limited to 2-
5, threshold 5x103 and 10 ppm dynamic exclusion for 60 seconds. Centroided MS2 scans were acquired
in in the ion trap in Rapid mode after CID fragmentation with a maximum fill time of 20 milliseconds and
1 m/z isolation quadrupole isolation window, c ollision energy of 30%, activation Q of 0.25, injection for
all available parallelizable time turned ON, and an AGC target value of 1x104. For MS3, fragment ions
were isolated from a 400-1200 m/z precursor range, ion exclusion of 20 m/z low and 5 m/z high, isobaric
tag loss exclusion for TMT, with a top 10 precursor selection. Acquisition was in profile mode with the
Orbitrap after HCD fragmentation (NCE 60%) with a maximum fill time of 90 milliseconds, 50,000 m/z
resolution, 120-750 m/z scan range, an AGC target value of 1x105, and all available parallelizable ON.
The total allowable cycle time was set to 4 seconds.
4.5.11 Peptide identification and protein quantification
Qualitative and quantitative proteomics analysis was done using ProteomeDiscoverer 2.1.1.21 (Thermo
Fisher Scientific). To maintain consistency with transcriptome annotation, we used Ensembl GRCh38.87
human reference proteome sequence database for proteome annotation. Sequest HT 1.3 was used for
Peptide Spectral Matches (PSM), with parameters specified as trypsin enzyme, two missed cleavages
allowed, minimum peptide length of 6, precursor mass tolerance 10 ppm, and a fragment mass toler-
ance of 0.6 Da. We allowed up to 4 variable modifications per peptide from the following categories:
acetylation at protein terminus, methionine oxidation, and TMT label at N-terminal residues and the side
chains of lysine residues. In addition, carbamidomethylation of cysteine was set as a fixed modification.
PSM results were filtered using q-value cut off of 0.05 to control for FDR determined by Percolator.
Identified peptides from both high and medium-confidence level after FDR-filtering were included in
the final stage to provide protein identification and quantification results. Reporter ions from MS3 scans
were quantified with an integration tolerance of 20ppm with the most confident centroid. Proteins were
66
further filtered to include only those found with minimum one peak in all samples. Proteome Discoverer
processed data was exported for further statistical analysis.
4.5.12 Mutational signature analysis
We used deconstructSigs [143], a multiple regression approach to statistically quantify the contribution
of mutational signature for each tumor. The mutational signature were obtained from the COSMIC
mutational signature database [8]. Both silent and non-silent somatic mutations were used together
to obtain the mutational signatures. Only mutational signatures with a weight more than 0.06 were
considered for analysis.
4.5.13 Prioritization of driver genes using HIT’nDRIVE
Non-silent somatic mutation calls, CNA gain or loss, and gene-fusion calls were collapsed in gene-patient
alteration matrix with binary labels. Gene-expression values were used to derive expression-outlier gene-
patient outlier matrix using GESD test. STRING ver10 [167] protein-interaction network was used to
compute pairwise influence value between the nodes in the interaction network. We integrated these
genome and transcriptome data using HIT’nDRIVE algorithm [155]. Following parameters were used:
α=0.9, β=0.6, and γ=0.8. We used IBM-CPLEX as the ILP solver.
4.5.14 Consensus clustering
We used ConsensusClusterPlus [199] R-package to perform consensus clustering. We used the following
parameters: maximum cluster number to evaluate: 10, number of subsamples: 10000, proportion of items
to sample: 0.8, proportion of features to sample: 1, cluster algorithm: hierarchical, distance: pearson.
4.5.15 Protein attenuation analysis
For every gene/protein profiled for CNA (segment mean), RNA-seq (normalized log2 expression), and
MS (normalized log2 expression), we performed the following analysis. For every gene/protein, the
Pearson correlation coefficients were calculated for CNA-mRNA expression (RCNA:mRNA) and CNA-
protein expression (RCNA:protein). The 75th percentile of the difference between the above two correlation
coefficients i.e. Rdiff = RCNA:mRNA−RCNA:protein was found to be approximately 0.45. Therefore those
proteins with Rdiff ≥ 0.45 were considered as attenuated proteins.
67
4.5.16 Pathway enrichment analysis
The selected set of genes were tested for enrichment against gene sets of pathways present in Molecular
Signature Database (MSigDB) v6.0 [162] A hypergeometric test based gene set enrichment analysis
was used for this purpose (https://github.com/raunakms/GSEAFisher). A cut-off threshold of FDR <
0.01 was used to obtain the significantly enriched pathways. Only pathways that are enriched with at
least three differentially expressed genes were considered for further analysis. To calculate the pathway
activity score, the expression dataset was transformed into standard normal distribution using ‘inverse
normal transformation’ method. This step is necessary for fair comparison between the expression-
values of different genes. For each sample, the pathway activity score is the mean expression level of the
differentially expressed genes linked to the enriched pathway.
4.5.17 Stromal and immune score
We used two sets of 141 genes (one each for stromal and immune gene signatures) as described in [206].
We used ‘inverse normal transformation’ method to transform the distribution of expression data into the
standard normal distribution. The stromal and immune scores were calculated, for each sample, using
the summation of standard normal deviates of each gene in the given set.
4.5.18 Enumeration of tissue-resident immune cell types using mRNA expressionprofiles
CIBERSORT algorithm [124] was applied to the RNA-seq gene-expression data to estimate the propor-
tions of 22 immune cell types (B cells naive, B cells memory, Plasma cells, T cells CD8, T cells CD4
naive, T cells CD4 memory resting, T cells CD4 memory activated, T cells follicular helper, T cells
gamma delta, T cells regulatory (Tregs), NK cells resting, NK cells activated, Monocytes, Macrophages
M0, Macrophages M1, Macrophages M2, Dendritic cells resting, Dendritic cells activated, Mast cells
resting, Mast cells activated, Eosinophils, and Neutrophils) using LM22 dataset provided by CIBER-
SORT platform. Genes not expressed in any of the PeM tumor samples were removed from the LM22
dataset. The analysis was performed using 1000 permutation. The 22 immune cell types were later
aggregated into 11 distinct groups.
4.5.19 External datasets
TCGA datasets for 16 different cancer-types used in this study were downloaded from the National
Cancer Institute-Genomic Data Commons (NCI-GDC; https://portal.gdc.cancer.gov/) on February 2017.
For somatic mutation data, non-silent variant calls that were identified by at least three out of four dif-
68
ferent tools (MUSE, MuTect2, SomaticSniper and VArScan2) were considered. CNA segmented data
were further processed using Nexus Copy Number Discovery Edition Version 9.0 (BioDiscovery, Inc.,
El Segundo, CA) to identify aberrant regions in the genome. In case of the RNA-seq expression data,
HTSeq-FPKM-UQ normalized data were used.
69
Figure 4.1: Landscape of somatic mutations in PeM tumors. (A) Comparison of somatic muta-tion rate in protein-coding regions of PeM with different adult cancers obtained from TCGA.(B) Mutational signature present in PeM (top panel). Proportional contribution of differ-ent COSMIC mutational signature per tumor sample. (C) Somatic alterations identified inPeM tumors group by important cancer-pathways. LUSC: Lung Squamous Cell Carcinoma,LUAD: Lung adenocarcinoma, BLCA: Urothelial Bladder Carcinoma, COAD: Colorectalcarcinoma, UCEC: Uterine Corpus Endometrial Carcinoma, OV: Ovarian cancer, KRIP: Kid-ney renal papillary cell carcinoma, KIRC: Kidney Renal Clear Cell Carcinoma, UCS: Uter-ine Carcinosarcoma, GBM: Glioblastoma Multiforme, BRCA: Breast Invasive Carcinoma,MESO-PM: Malignant Pleural Mesothelioma, MESO-PeM: Malignant Peritoneal Mesothe-lioma, PAAD: Pancreatic Adenocarcinoma, PRAD: Prostate Adenocarcinoma, KICH: KidneyChromophobe, TGCT: Testicular Germ Cell Tumor.
70
Figure 4.2: Landscape of copy number aberrations in PeM tumors. (A) Aggregate copy-number alterations by chromosome regions in PeM tumors. Important genes with copy-number changes are highlighted. (B) Sample-wise view of copy-number alterations in PeMtumors. (C) Comparison of copy-number burden (considering protein-coding regions only)in PeM with respect to other adult cancers obtained from TCGA. (D) Highly aberrant ge-nomic regions in PeM prioritized by GISTIC. (E) mRNA expression pattern of BAP1 acrossall PeM samples. The Wilcoxon signed-rank test p-value for BAP1 mRNA expression com-pared between the PeM subtypes is indicated in the box. (F) Detection of BAP1 nuclearprotein expression in PeM tumors by immunohistochemistry (Photomicrographs magnifica-tion - 20x). (G) Unsupervised consensus clustering of tumor samples based on copy-numbersegmentation mean values of the 3349 most variable genes.
71
Figure 4.3: Gene fusions in PeM. (A-B) Circos plot showing the gene fusion events identifiedin PeM tumors. (A) BAP1intact subtype (B) BAP1del subtype. (C-F) Few selected gene fu-sion events identified in PeM tumors. The top and middle panel shows the chromosomeand the transcripts involved in the gene fusion event. The bottom panel shows the RNA-seq read counts detected for the respective transcripts. (C) KANSL1-ARL17B fusion, (D)PBRM1-ADGB fusion, (E) SETD2-CHP1 fusion, and (F) PHF7-PBRM1 fusion. (G-J) Thechromatogram showing the Sanger sequencing validation of the fusion-junction point.
72
Figure 4.4: Transcriptome and proteome profile of PeM. (A-B) Principal component analysisof PeM tumors using (A) transcriptome profiles and (B) proteome profiles. (C) Effects ofCNA on transcriptome and proteome. In the scatterplot, each dot represents a gene/pro-tein. The horizontal and vertical axes represent Pearson correlation coefficient between CNA-transcriptome and CNA-proteome respectively. Key cancer genes that undergo protein atten-uation have been highlighted. (D) Geneset enrichment analysis of attenuated proteins againstgene ontologies (left panel) and Reactome pathways (right panel). (E-G) CORUM core pro-tein complexes regulated by PBRM1 and/or BAP1. The nodes represent individual proteinsubunit of the respective complex. The node color represents correlation of mRNA expressionof respective gene with BAP1. The border color of the node indicates whether the respectiveprotein is attenuated or not. The edge represents interaction between the protein subunits.The edge information were extracted from STRING v10 PPI network. The edge color (andedge thickness) represents correlation of protein expression between the respective interactionpartners. (E) SWI/SNF complex B (PBAF), (F) SWI/SNF complex A (BAF), and (G) HDACcomplex. (H-I) mRNA and protein expression level differences between PeM subtypes. (H)SWI/SNF complex and (I) HDAC complex. The expression levels are log2 transformed andmean normalized.
73
Figure 4.5: Immune cell infiltration in PeM tumors. (a-b) Pathways enrichment of top-500 dif-ferentially expressed genes between PeM subtypes obtained using (a) mRNA expression and(b) protein expression. (c-d) Correlation between immune score and stromal score derivedfor each tumor sample using (c) mRNA expression and (d) protein expression. (e) Estimatedrelative mRNA fractions of leukocytes infiltrated in PeM tumors based on CIBERSORT anal-ysis. (f) CD3 and CD8 immunohistochemistry showing immune cell infiltration on BAP1del
PeM tumor (Photomicrographs magnification - 20x). (g) mRNA expression differences inimmune checkpoint receptors between PeM subtypes. The bar plot on the right representsnegative log10 of Wilcoxon signed-rank test p-value computed between PeM subtypes. (h)Correlation between immune score and mRNA expression of immune checkpoint receptors.The expression levels are log2 transformed and mean normalized
74
Chapter 5
Combinatorial detection of conservedalteration patterns for identifying cancersubnetworks
5.1 IntroductionRecent large scale pan-cancer sequencing projects have revealed multitude of somatic genomic, tran-
scriptomic, proteomic and epigenomic alterations across cancer types. However, a tumor is likely driven
by selected few alterations that provide evolutionary advantage to the tumor, hence called “driver” al-
terations [195]. Distinguishing driver alterations from functionally inconsequential random “passenger”
alterations is critical for therapeutic development and cancer treatment.
It is well evident that, except for few cases, cancers are often driven by multiple driver genes [12,
155]. Whereas emergence of alterations is likely a consequence of endogenous or exogenous mutagen
exposures [7], their evolutionary selection depends on the functional role of the affected genes [195] and
synergistic combinations of different alterations. For example, TMPRSS2-ERG gene fusion is considered
as an early driver event in almost half of prostate cancer cases, and it often co-exists with copy-number
deletions of PTEN as well as NKX3-1 to drive cancer progression [31, 90, 93]. Recently, concomitant
deletion of four cancer genes - BAP1, SETD2, PBRM1, and SMARCC1 in chromosome locus 3p21 has
been identified as a driver event in a fraction of clear cell renal cell carcinoma (ccRCC) [33], uveal
melanoma [142], and mesotheliomas [208]. These genes are involved in chromatin remodeling process,
and their loss further impairs DNA damage repair pathway in the aberrant tumors [142].
75
Co-occurring alterations might be evolutionary selected because alteration in one gene might en-
hance the deleterious effect of the other [28]. Such co-selected genes are often a part of a functionally
interacting driver subnetwork (or pathway) that are observed together in the same tumor, and define
its phenotype. In fact, as demonstrated by the pancancer and other large scale sequencing efforts, co-
occurring genomic and transcriptomic alterations in specific tumor types are commonly shared across a
large fraction of patients. Thus efficient computational methods that can identify large subsets of func-
tionally interacting (genomic or transcriptomic) alterations, highly conserved across specific tumor types,
are in high demand.
5.2 Literature ReviewRecently, a number of computational methods have been developed to identify recurrent genomic (as well
as transcriptomic) alteration patters across tumor samples. Some of these methods have been designed
to identify multiple gene alterations simultaneously, based on their co-occurrence or mutual exclusivity
relationships in a tumor cohort, without any reference to a molecular interaction network [45, 89, 118].
Other approaches have been developed with the aim of identifying a specific subnetwork within a molec-
ular interaction network, either through (i) a combinatorial formulation, with the goal of maximizing the
total weight of the subnetwork in a molecular interaction network with node (and possibly edge) weights
[53, 108], or (ii) a network diffusion process to derive specific mutated pathways [102, 189]. A direction
particularly relevant to our study is motivated by [6, 88, 185, 189], and explored by Bomersbach et al.
[18], which proposed an alternative formulation for finding a subnetwork of a given size k with the goal
of minimizing h, the number of samples for which at least one gene of the subnetwork is in an altered
state. (A similar formulation where the goal is to maximize a weighted difference of k and h, for varying
size k, can be found in [76].) Although the above combinatorial problems are typically NP-hard, they
became manageable through the use of state of the art ILP solvers or greedy heuristics, or by the use of
complex preprocessing procedures.
Complementary to the ideas proposed above, there are also several approaches to identify mutually
exclusive (rather than jointly altered) sets of genes and pathways [37, 117, 190]. These approaches utilize
the mutational heterogeneity prevalent in cancer genomes, and are driven by the observation that muta-
tions acting on same pathway are many times mutually exclusive across tumor samples. Although, from
a methodological point of view, these approaches are very interesting, they are not trivially extendable
to the problem of identifying co-occurring alteration patterns (involving more than two genes) conserved
across many samples.
76
5.3 Our ContributionsIn this chapter, we present a novel computational method, cd-CAP (combinatorial detection of Conserved
Alteration Patterns), that primarily uses an ILP formulation to identify subnetworks of an interaction
network, each with an alteration pattern conserved across (a large subset of) a tumor sample cohort.
Some of the previous methods described above, attempt to solve a variant of the problem but do so by
considering only a single network and using binary labeled genes – indicating whether the gene is altered
or not. Unlike these approaches, our method simultaneously identifies more than one subnetwork, and
each gene within each subnetwork has labels specific to the alteration types it harbors. In fact, we allow
a gene to have more than one label, each corresponding to a specific alteration type: somatic mutation,
copy number alteration, or aberrant expression. From this point on we will refer to each distinct alteration
type as a specific “color” of the corresponding node in an interaction network.
The algorithmic framework of cd-CAP consists of two major steps. The first step is an exhaustive
search method (a variant of the a-priori algorithm) that was originally designed for association rule min-
ing [3]. This step computes the set of all “candidate” subnetworks (each with a distinct color assignment)
of size at most k shared among at least t samples (both k and t are user defined parameters). cd-CAP
provides the user the additional options that (i) at least two distinct colors should be present in the col-
oring of a subnetwork, or (ii) each sample network can include up to a fraction δ of nodes whose color
assignment differ from that of the “template”. cd-CAP also gives the user to stop at this point and provide
(a) the largest colored subnetwork that appears in at least t samples (we report on some results obtained
with this option), or (b) the colored subnetwork of size k that is shared by the largest number of samples.
Alternatively, the second step solves the maximum conserved subnetwork cover problem which asks to
cover the maximum number of nodes in all samples with at most l colored subnetworks (l is user defined)
- obtained in the first step - via ILP.
We have applied cd-CAP - with each of the possible options above, i.e., (i), (ii), (a) and (b) - to
TCGA breast cancer (BRCA), colorectal adenocarcinoma (COAD), and glioblastoma (GBM) datasets,
which collectively include over 1000 tumor samples. cd-CAP identified several connected subnetworks
of interest, each exhibiting specific gene alteration pattern across a large subset of samples.
In particular, cd-CAP results with option (i) demonstrated that many of the largest highly conserved
subnetworks within a tumor type solely consist of genes that have been subject to copy number gain, typ-
ically located on the same chromosomal arm and thus likely a result of a single, large scale amplification.
One of these subnetworks cd-CAP observed (in about one third of the COAD samples [170]) include 9
genes in chromosomal arm 20q, which corresponds to a known amplification recurrent in colorectal tu-
mors. Another copy-number gain subnetwork cd-CAP observed in breast cancer samples correspond to
77
a recurrent large scale amplification in chromosome 1 [42]. It is interesting to note that cd-CAP was able
to re-discover these events without specific training.
Several additional subnetworks identified by option (i) solely consist of genes that are aberrantly ex-
pressed. Further analysis with options (ii) and (b) of cd-CAP revealed subnetworks that capture signaling
pathways and processes critical for oncogenesis in a large fraction of tumors. We have also demonstrated
that the subnetworks identified through all three options of cd-CAP are associated with patients’ survival
outcome and hence are clinically important.
In order to assess the statistical significance of subnetworks discovered by cd-CAP - option (a), we
introduce for the first time a model in which likely inter-dependent events, in particular amplification or
deletion of all genes in a single chromosome arm, are considered as a single event. Conventional models
of gene amplification either consider each gene amplification independently [36] (this is the model we
implicitly assume in our combinatorial optimization formulations, giving a lower bound on the true p-
value), or assumes each amplification can involve more than one gene (forming a subsequent sequence of
genes) but with the added assumption that the original gene structure is not altered and the duplications
occur in some orthogonal “dimension” [54, 148, 211]. Both models have their assumptions that do not
hold in reality, but inferring evolutionary history of a genome with arbitrary duplications (that convert
one string to another, longer string, by copying arbitrary substrings to arbitrary destinations) is NP-hard
and even hard to approximate [40, 123]. By considering all copy number gain or loss events in the same
chromosomal arm as a single event, we are, for the first time, able to compute an estimate that provides
an empirical upper bound to the statistical significance (p-value) of the subnetworks discovered. (Note
that this is not a true upper bound since a duplication event may involve both arms of a chromosome -
but that would be very very rare.) Through this upper bound, together with the lower bound above, we
can sandwich the true p-value and thus the significance of our discovery.
5.4 Algorithmic Framework of cd-CAP
5.4.1 Combinatorial Optimization Formulation
Consider an undirected and node-labeled graph G = (V,E), representing the human gene or protein
interaction network, with n nodes where v j ∈V represent genes and e=(vh,v j)∈E represent interactions
among the genes/proteins. Let us assume that we have m copies of the original network G, where each
copy represents an individual sample Pi in a cohort. In each network Gi = (V,E,Ci) corresponding to
sample Pi, each node vi, j (as a copy of v j) is colored with one or more possible colors to form the set
Ci, j (i.e. Ci maps vi, j to a possibly empty subset of colors Ci, j). Each color represents a distinct type
78
of alteration harbored by a gene/protein, in particular somatic mutation (single nucleotide alteration or
short indel), copy number gain, copy number loss or significant alteration in expression (which can be
trivially expanded to include genic structural alteration - micro-inversion or duplication, gene fusion,
alternative splicing, methylation altearation, non-coding sequence alteration) observed in the gene and
the protein product. Without loss of generality, Ci, j = /0 implies none of the possible alteration events are
observed at vi, j, and two nodes vi, j,vi′, j corresponding to each other in two distinct samples have at least
one matching color if Ci, j ∩Ci′, j 6= /0.
The main goal of cd-CAP is to identify conserved patterns of (i.e. identically colored) connected
subnetworks across a subset of sample networks Gi. Consider a connected subnetwork T = (VT ,ET ) of
the original interaction network G, where each node v j ∈ VT is assigned exactly one color c j. Such a
colored subnetwork is said to be shared by a collection of sample networks Gi(i ∈ I) if each node of the
subnetwork harbor the same color in every sample network i.e. c j ∈⋂
i∈I Ci, j for each v j ∈VT . A colored
node in a sample network is said to be covered by a subnetwork if the subnetwork is shared by the node’s
sample network (Fig. 5.1). Intuitively, a colored subnetwork represents a conserved pattern or a network
motif.
cd-CAP combinatorially formulates the problem of identifying conserved patterns of subnetworks
as the Maximum Conserved colored Subnetwork Identification problem (MCSI). Here the goal is
to find the largest connected subnetwork S of the interaction network G, that occur in exactly t (a user
specified number) samples P , such that each node in S has the same color in each sample Pi(∈P). Note
that this formulation is orthogonal to that used in [18] and [76], where the goal is to maximize the number
of samples that share a fixed size subnetwork. The advantage of formulating the problem as MCSI is that
it naturally admits a generalization of the a-priori algorithm. We also note that our formulation considers
distinct types of mutations (as colors) in the conserved alteration patterns, another key improvement to
that used in [18, 76].
cd-CAP also supports simultaneous identification of multiple conserved subnetworks that are altered
in a large number of samples. In one potential formulation of the problem one may aim to cover all
nodes vi, j in all m input sample networks Gi, with the smallest number of subnetworks T = (VT ,ET )∈T
shared by at least one sample network. We refer this combinatorial optimization problem as MinimumSubgraph Cover Problem for (Node) Colored Interaction Networks (MSC-NCI).
One advantage of the MSC-NCI problem is that it is parameter-free. However, in a realistic multi-
omics cancer dataset, the number of genes far exceeds the number of samples represented. Under such
conditions, the solution to the MSC-NCI problem will primarily include subnetworks that are large con-
nected components that are shared by only one sample network. To account for this situation, we intro-
duce the following parameters/constraints akin to those for the MCSI formulation: (1) we require that
79
the nodes in each subnetwork have the same color shared by at least t samples (in the remainder of the
discussion, t is referred to as depth of a subnetwork); and (2) we require that each subnetwork returned
contains at most k nodes. Note that this variant of the problem is infeasible for certain cohorts (consider
a particular node which has a unique color for a particular sample; clearly requirement 1 can not be sat-
isfied if t > 1). Even if there is a feasible solution, the requirement that each subnetwork in T is of size
at most k makes the problem NP-hard (the reduction is from the problem of determining whether G can
be exactly partitioned into connected subnetworks, each with k nodes [52]). As a result (3) we introduce
one additional parameter, l, the maximum number of subnetworks (each of size at most k, and which
are color-conserved in at least t samples) with the objective of covering the maximum number of nodes
across all samples. We call the problem of identifying at most l subnetworks of size at most k, whose
colors are conserved across at least t samples, so as to maximize the total number of nodes in all these
samples covered by these subnetworks, as the Maximum Conserved Subnetwork Coverage problem(MCSC).
5.4.2 Algorithmic Framework for solving MCSC
We formulate the MCSC problem (as well as MSC-NCI problem) as an ILP. A straightforward applica-
tion of available ILP solvers can only handle relatively small instances of the MSC-NCI problem. This
is because the number of variables and the number of constraints for the MSC-NCI ILP formulation are
O(n2m2) and O(n2m3) respectively, both very large for a typical problem instance. Fortunately, in all
instances of interest, only a limited number of genes are colored in comparison to the total number of
nodes nm. This enables us to apply an exhaustive search method that is designed for association rule
mining [3] to build a list of all candidate subnetworks exactly and efficiently (e.g. in comparison to the
ILP or heuristic solutions in [18, 76]) and then solve the MCSC on the set of candidate subnetworks1.
Generating Conserved Subnetworks
We generate the complete list of candidate subnetworks with minimum depth t by the use of “anti-
monotone property” [103]: if any subnetwork S has depth < t, then the depth of all of its supergraphs
S′ ⊃ S must be < t. This makes it possible to grow the set S of valid subnetworks comprehensively
but without repetition (as described as “optimal order of enumeration” in [113]) through the following
breadth-first network growth strategy.
1. For every colored node vi, j and each of its colors c`, we create a candidate subnetwork of size 1
1 Note that our exhaustive search method is an extension of the a-priori algorithm with the difference that we require thecandidate subnetworks to maintain connectivity as they grow.
80
containing the node with color c`. All samples in which the node is colored c` naturally share this
trivial subnetwork.
2. We inductively consider all candidate subnetworks of size s with the goal of growing them to
subnetworks of size s+1 as follows. For a given subnetwork T of size s, consider each neighboring
node u. For each possible color c′` of u, we create a new candidate subnetwork of size s+ 1 by
extending T with u - with color c′`. We maintain this subnetwork for the next inductive step only
if the number of samples sharing this new subnetwork is at least t; otherwise, we discard it.
During the extension of T above, if the new node u does not reduce the number of samples sharing it, T
becomes redundant and is not considered in the ILP formulation.
Solving MCSC
Given the universe U = vi, j |Ci, j 6= /0 , i = 1, · · · ,m; j = 1, · · · ,n, containing all the coloured nodes in
all the sample networks, and the collection of all subnetworks
S = Ti |Ti shared by at least t samples and contains at most k nodes
our goal is to identify up to l subnetworks from the collection S whose union contains the maximum
possible number of elements of the universe U .
After the list of all candidate subnetworks S is constructed (as described in the previous subsection),
we represent the MCSC problem with the following ILP and solve it using IBM-CPLEX or Gurobi. A
binary variable C[i, j] corresponds to whether colored node vi, j was covered by at least one chosen sub-
network, and binary variable X [i] corresponds to whether colored candidate subnetwork Ti was one of
the chosen. Let Si, j represent the set of all subnetworks of S which contain node vi, j properly colored
in them.
Maximize ∑vi, j∈U
C[i, j]
s.t. ∑Tp∈Si, j
X [p]≥C[i, j] (∀vi, j ∈U )
∑Ti∈S
X [i]≤ l
81
Special Types of Conserved Subnetworks.
In addition to the exactly-conserved colored subnetworks obtained through the general MCSC formula-
tion, we also consider two important variants.
1. Colorful Conserved Subnetworks. A colorful subnetwork T is one that has at least two distinct
colors represented in the coloring of its nodes, i.e. c`,ch ∈⋂
v j∈T C j (c` 6= ch). In some of the
datasets that we analyzed, certain colors were dominant in the input to such extent that all subnet-
works identified by our method had all nodes colored the same. By restricting focus to colorful
subnetworks, it is possible, e.g., to capture conserved patterns of potential driver alterations and
their impact on their vicinity in the interaction network, in the form of expression alterations. In
order to identify the maximalcolorful conserved subnetwork of a given depth t in the tumor sam-
ples, we only need to keep track of the colorful subnetworks in each iteration - since any colorful
network must contain a connected colorful subnetwork.
2. Subnetworks Conserved within error rate δ . In order to reduce the sensitivity of our method to
noise (or lack of precision in generating the data) in the input when detecting conserved patterns,
we extend our formulation to allow some “errors” in identifying conserved subnetworks. We
define δ , the error rate of a colored subnetwork T as the maximum allowable fraction of nodes of
T without an assigned color in any sample Pi that shares T . For tolerating an error rate of δ , we
extend our algorithm to generate candidate subnetworks S for the MCSC problem by performing
a post-processing step in which the list of samples sharing subnetwork T is increased by including
all samples that share T with an error rate of δ . (Note that our notion of error is restricted to nodes
that do not have a color, i.e. an observed alteration, in each specific sample.)
5.5 Results
5.5.1 Dataset Used
We obtained somatic mutation, copy number aberration and RNA-seq based gene-expression data from
three distinct cancer types - glioblastoma multiforme (GBM) [175], breast adenocarcinoma (BRCA)
[177], and colon adenocarcinoma (COAD) [170] from The Cancer Genome Atlas (TCGA) datasets.
In addition, we distinguish four commonly observed molecular subtypes (i.e. Luminal A, Luminal B,
Triple-negative/basal-like and HER2-enriched) from the BRCA cohort. For each sample, we obtained
the list of genes which harbor somatic mutations, copy number aberrations, or are expression outliers as
per below.
82
Somatic Mutations. All non-silent variant calls that were identified by at least one tool among MUSE,
MuTect2, SomaticSniper and VarScan2 were considered.
Copy Number Aberrations. CNA segmented data from NCI-GDC were further processed using Nexus
Copy Number Discovery Edition Version 9.0 (BioDiscovery, Inc., El Segundo, CA) to identify aberrant
regions in the genome. We restricted our analysis to the most confident CNA calls selecting only those
genes with high copy gain or homozygous copy loss.
Expression outliers. We used HTSeq-FPKM-UQ normalized RNA-seq expression data to which we
applied the GESD test [144]. In particular, we used GESD test to compare the transcriptome profile of
each tumor sample (one at a time) with that from a number of available normal samples. For each gene,
if the tumor sample was identified as the most extremely deviated sample (using critical value α = 0.1),
the corresponding gene was marked as an expression-outlier for that tumor sample. This procedure was
repeated for every tumor sample. Finally, comparing the tumor expression profile of these outlier genes
to the normal samples, their up or down regulation expression patterns were determined.
5.5.2 Maximal Colored Subnetworks Across Cancer Types
We used cd-CAP to solve the maximum conserved colored subnetwork identification problem exactly
in (each one of the four) protein-interaction network(s) on each cancer type - for every feasible value
of network depth. As can be easily observed, the depth and the size of the identified subnetwork are
inversely related. We say that a given value of the network depth is feasible if (i) the depth is at least
10% of the cohort size, (ii) the maximum network size for that depth is at least 3, (iii) the number of
“candidate”subnetworks are at most 2M per iteration when running cd-CAP for that depth.
The number of maximal solutions of cd-CAP as a function of network depth for each cancer type
(COAD, GBM, BRCA Luminal A, and BRCA Luminal B) is shown in figure 5.2A-D on STRING v10
PPI network with high confidence edges. In general, for a fixed network size, the number of distinct
networks of that size decreases as the network depth increases. One can observe the “valleys” in the
colored plots in figure 5.2A-D which correspond to the largest depth that can be obtained for a given
subnetwork size. Throughout the remainder of the paper we focus on the colored subnetworks of each
given size for which the network depth is maximum possible - which correspond to the valleys in the
plots. If for a given subnetwork size and the corresponding maximal depth, cd-CAP returns more than 1
subnetwork, we discard those solutions.
Most of the subnetworks, especially those with large depth, identified for each of the four cancer
types consisted of expression outlier genes (typically all upregulated or all downregulated) only (fig-
ure 5.2A-D). As the network depth decreases, maximal subnetworks that consist only of copy number
83
variants emerge. One of the most prominent copy-number gain subnetworks of the COAD dataset has
depth 163 out of 463 patients in the cohort. This network forms the core of the larger maximal subnet-
works cd-CAP identifies for lower depth values; it corresponds to a copy number gain of the chromo-
somal arm 20q - a known copy number aberration pattern highly specific to colorectal adenocarcinoma
tumors [170].
Another subnetwork cd-CAP identified in 15% of the 422 BRCA Luminal-A samples corresponds
to a copy number gain on chromosome 1, which is again a known aberration associated with breast
cancer [42]. With increasing depth, the maximal subnetworks cd-CAP identifies in Luminal A cohort
start to consist solely of expression outlier genes. In particular cd-CAP identified a subnetwork of eight
underexpressed genes with network depth 90 (Fig. 5.2E) - consisting of genes EGFR, PRKCA, SPRY2,
and NRG2, known to be involved in EGFR/ERBB2/ERBB4 signaling pathways (Fig. 5.2F). EGFR is an
important driver gene involved in progression of breast tumors to advanced forms [171] and its altered
expression is observed in a number of breast cancer cases [42]. The subnetwork also included MET,
another well-known oncogene [119], and is enriched for members of the Ras signaling pathway, which
is also known for its role in oncogenesis and mediating cancer phenotypes such as over-proliferation
[57].
In order to test for the association between the subnetworks identified by cd-CAP and patient survival
outcomes, we used a risk-score defined as a linear combination of the normalized gene-expression values
of the genes in the subnetwork weighted by their estimated univariate Cox proportional-hazard regression
coefficients (see Methods section for details). Based on the risk-score values, the patients covered by the
subnetwork were stratified into two risk group. Luminal A subnetwork was the most significant among
all subnetworks identified in this dataset (Fig. 5.2G). The patients in the high-risk group have poor
overall survival outcome suggesting clinical importance of the identified subnetwork by cd-CAP.
As another example, we identified a colored subnetwork with copy number gain genes that covered
163 patients in the COAD dataset (Fig. 5.2H). The genes in this subnetwork belong to the same chro-
mosome locus 20q13, suggesting that they may comprise a single region of chromosomal amplification.
Intriguingly all the members forms a linear pathway-like structure also on the PPI level. Among them
is a group of functionally related genes consisting of transcription factors and their regulators (genes
CEBPB, NCOA’s, UBE2’s), which are known to be involved in the intracellular receptor signaling path-
way (Fig. 5.2I). CEBPB and UBE2’s are also involved in the regulation of cell cycle [82]. To the other
end of the linear subnetwork, we found MMP9 and SDC4, the established mediators of cancer invasion
and apoptosis [30, 82]. Also we confirmed that this set of genes are highly predictive of the patients’
survival outcome (Fig. 5.2J). These results support the functional importance and clinical relevance of
the subnetwork we identified.
84
5.5.3 Maximal Colorful Subnetworks Across Cancer Types
We next used cd-CAP to solve the maximum conserved colored subnetwork identification problem - with
at least two distinct colors (see Section 5.4.2 for details), in each of the four protein-interaction network(s)
and on each cancer type. Again, cd-CAP was run with every feasible value (as defined above) of network
depth. The number of maximal solutions of cd-CAP as a function of network depth for each cancer type
(COAD, GBM, BRCA Luminal A, and BRCA Luminal B) is shown in figure 5.3A-D on STRING v10
PPI network with high confidence edges. Note that we distinguish here the maximal subnetworks with
one or two sequence-level alterations (i.e. somatic mutations and copy number alterations) – which
is of potential interest since their neighboring expression-level alterations are possibly caused by these
sequence-level alterations (figure 5.3E provides an example) – with all the other cases. Similarly, we
only focus on the maximal colorful subnetworks of every possible size for which the network depth is
maximum possible and discard the solutions when cd-CAP returns more than 1 colorful subnetworks for
each feasible value of network depth.
One colorful COAD subnetwork of note is composed of overexpressed genes with an additional
copy number gain gene that covers 108 patients (Fig. 5.3E). This subnetwork is mainly enriched for
genes involved in ribosome biogenesis (Fig. 5.3G). Cancer has been long known to have an increased
demand on ribosome biogenesis [120], and increased ribosome generation has been reported to contribute
to cancer development [131]. The biological relevance of this subnetwork is also supported by survival
analysis, which shows a strong differentiation between the high-risk and low-risk groups - see figure
5.3F.
Another colorful subnetwork we observed in 58 BRCA Luminal A samples consists of four copy
number gained genes, an overexpressed gene, and two underexpressed genes, including EGFR (Fig.
5.3H). All copy-number gained genes and the overexpressed gene are located in chromosome 1q, com-
monly reported in breast cancer [42]. The subnetwork involves an interesting combination of the down-
regulation of the cancer gene EGFR and the amplification of a group of genes involved in T-cell receptor
signaling (PTPRC, CD247, and ARPC5; see figure 5.3I). Thus we may surmise that the covered popula-
tion of patients potentially have relatively low cancer proliferation index with higher anti-tumor immune
response, which can be highly relevant indicators with regard to clinical outcome. Indeed, this subnet-
work is significantly associated with patients’ survival (Fig. 5.3J).
5.5.4 Multiple-Subnetwork Analysis Across Cancer Types
We next sought to detect up to 5 subnetworks per cancer type that collectively cover maximum possible
number of colored nodes by solving the MCSC problem on STRING v10.5 network (with experimentally
85
validated edges). The subnetwork extension error rate was set to 20%, and we restricted the search space
to subnetworks which do not consist only of expression outlier nodes, in order to obtain what we believe
to be more biologically interesting results. Parameter t was chosen for each dataset in a way that made it
possible to construct all candidate subnetworks of maximum possible size while keeping the total number
of candidate subnetworks below 2×106, making the problem solvable in reasonable amount of time. We
set t to 69 (15% of the patients), 62 (10% of the patients), and 110 (10% of the patients) respectively for
COAD, GBM, and BRCA datasets. Table 5.1 shows the size, per sample depth and the coloring of the
nodes in the resulting subnetworks.
We note that the subnetworks identified in the GBM dataset had the lowest depth (10-15% of the
samples). COAD and BRCA datasets on the other hand have much larger depth (respectively 30-48% and
15-32% of the samples). Smaller subnetworks of the GBM dataset solely consist of copy number gain
genes on chromosome 7q, a known amplification in GBM [22]. The two large subnetworks each contain
a single gene with copy number gain (SEC61G and EGFR, respectively) accompanied by several of
overexpressed genes. BRCA dataset exhibits a similar pattern: each of the four large subnetworks contain
a single copy number gain gene from chromosome 8q, (NSMCE2 in one and MYC in the remaining
three subnetworks). Subnetworks detected in COAD dataset were much more colorful and recurrently
conserved in a larger fraction of samples than those in the other datasets. All genes with copy number
gain are located in chromosome 20q.
We identified a subnetwork with 15 nodes (11 genes with copy number gain, 1 overexpessed and
2 underexpressed genes) in 149 COAD patients (Fig. 5.4A). All 11 copy number gain genes belong to
chromosome 1q. IL6R, PLCG1, PTPN1, and HCK are involved in cytokine/interferon signaling to acti-
vate immune cells to counter proliferating tumor cells [160] (Fig. 5.4B). UBE2I, AURKA, and MAPRE1
are involved in cell cycle processes. This subnetwork was found to be associated with patients’ survival
outcome (Fig. 5.4C).
We identified another subnetwork with 15 nodes (14 overexpressed and 1 copy number gain genes)
in 313 breast cancer patients (Fig. 5.4D). Genes in this subnetwork are involved in cell cycle processes
(Fig. 5.4E). In particular the cell cycle checkpoint processes were dysregulated - which is known to drive
tumor initiation processes [194]. The subnetwork was found to be associated with patients’ survival
outcome (Fig. 5.4F) demonstrating its clinical relevance.
86
5.5.5 Empirical P-Value Estimates Confirm the Significance of cd-CAP IdentifiedNetworks
To evaluate the significance of cd-CAP’s findings, we performed the permutation test in Section 5.7.1
1000 times on each cancer type for each setting of subnetwork constraints. Figure 5.5 demonstrates
the distribution of the empirical p-value estimates. (The lower bound results look similar to what is
presented in the figure and thus are omitted.) In the permutation tests all cd-CAP identified subnetworks
(without additional constraints) of size 2-5 were composed solely of expression altered genes; in contrast
there are several larger CNV rich subnetworks observed in the TCGA COAD data set and others, further
confirming the significance of our findings. Colorful subnetworks presented in Figure 5.5 are even less
likely to occur at random (we therefore omit empirical p-value estimates for the networks in Figure 5.5).
5.6 DiscussionIn this study, we introduce a novel combinatorial framework and an associated tool named cd-CAP
which can identify (one or more) subnetworks of an interaction network where genes exhibit conserved
alteration patterns across many tumor samples. Compared with the state-of-the-art methods (e.g.[6,
76]), cd-CAP differentiates alteration types associated with each gene (rather than relying on binary
information of a gene being altered or not), and simultaneously detects multiple alteration type conserved
subnetworks.
cd-CAP provides the user with two major options. (a) It computes the largest colored subnetwork that
appears in at least t samples. This option exhibits significant speed advantage over available ILP-based
approaches; its a-priori based algorithmic formulation allows flexible integration of special constraints
(on maximal subnetworks) – not only simplifying complicated ILP constraints, but also further reducing
the number of candidate subnetworks in iteration steps (a good example for this is the “colorful con-
served subnetworks” as introduced in Section 5.4.2). However, the identified subnetworks are required
to be conserved, i.e., each node only admits one alteration type among the samples sharing it (although
we have relaxed constraints that allow each sample to have a few nodes without any alterations, i.e. col-
ors). In the future, we may extend the definition of a network to include nodes with color mismatches
(for example, according to the definition in [6] or [185]) with a modification to cd-CAP’s candidate sub-
network generation algorithm. (b) It solves the maximum conserved subnetwork cover (MCSC) problem
to cover the maximum number of nodes in all samples with at most l colored subnetworks (l is user de-
fined) via ILP. In the future, we aim to refine the MCSC formulation with reduced number of parameters
and hope to develop exact or approximate solutions.
Subnetworks identified by cd-CAP in COAD, GBM and BRCA datasets from TCGA are typically
87
enriched with genes harboring gene-expression alterations or copy-number gain. Notably, we observed
that genes in subnetworks with copy-number amplification are universally located in the same chromoso-
mal locus. Many of these genes have known interactions and are functionally similar, demonstrating the
ability of cd-CAP in capturing functionally active subnetworks, conserved across a large number of tu-
mor samples. These subnetworks seem to overlap with pathways critical for oncogenesis. In the datasets
analyzed, we observed cell cycle, apoptosis, RNA processing, and immune system processes that are
known to be dysregulated in a large fraction of tumors. cd-CAP also captured subnetworks relevant to
EGFR/ERBB2 signaling pathways, which have distinct expression patterns in specific subtypes of breast
cancer [42, 133]. Survival analysis of cd-CAP identified subnetworks also confirmed their substantial
clinical relevance.
5.7 Methods
5.7.1 Significance of the Identified Subnetworks
Under the assumption that each gene is altered independently, it is possible to apply the conventional
permutation test [18, 89, 190] to assess the statistical significance of the subnetworks identified by cd-
CAP as follows. Let Ci = (vi, j,c) : c ∈Ci, j 6= /0,vi, j ∈V be a binary relation representing the existing
colors on each node of sample network Gi. A permuted copy of the interaction network G′i = (V,E,C′i)
is generated (under the null hypothesis) by randomly shuffling the range of C , such that each node
vi, j takes a new set of colors C′i, j with the total number of colors ∑ j |Ci, j| in Gi preserved. (In other
words, ∑ j |Ci, j| = ∑ j |C′i, j|, and a simple implementation assigns |C′i, j| by random shuffling (|Ci, j| : j =
1,2, · · · ,n). An entire set of permuted sample networks consists of each randomly generated G′i, and this
permutation test is repeated sufficiently many (by default 1000) times. For a particular size k subnetwork
T = (VT ,ET ) identified by cd-CAP (on t samples) we define P1 as the fraction of these permutation tests
where any subnetwork of size at least k appear in t or more samples.
In fact, P1 presents a lower bound on the p-value for T since it ignores the inter-dependency of
node colors (gene alteration events). In particular, whole chromosome or chromosome arm level copy
number amplifications/deletions are commonly observed in various cancer types. To address this issue,
we apply the following procedure to calculate P2 as an empirical upper-bound for the p-value of T ,
under the assumption that copy number alterations take place in whole chromosome arms. First we
identify all genes v j ∈V on the same chromosome arm, chr(v j) and construct a set of supernodes Uchri =
chr(vi, j) : ∃c, (vi, j,c) ∈ Ci from the genes on the same chromosome arm for each sample Pi. Let
NE = |(vi, j,E) ∈ Ci| denote the number of nodes with color E (corresponding to either a copy number
88
gain or loss) in sample Pi. Then, each supernode is assigned the color E independently with probabilityNE
|Ci|, which guarantees that the expected count of E in Pi is preserved. Finally we randomly assign the
remaining colors to those nodes without a color assignment thus far, to obtain a new randomly permuted
interaction network G′′i = (V,E,C′′i ) towards an empirical p-value (upper bound) estimate. We again
repeat this process sufficiently many (by default 1000) times to generate distinct permuted datasets and
derive P2 by counting the fraction of these datasets where any subnetwork of size at least k appear in
t or more samples. The true statistical significance is expected to be in the range [P1,P2] provided that
chromosome arms form the largest units of alteration.
5.7.2 Pathway enrichment analysis
The set of genes in the subnetwork were tested for enrichment against gene sets of pathways present in the
Molecular Signature Database (MSigDB) v6.0 [162]. A hypergeometric test based gene set enrichment
analysis [162] was used for this purpose. A cut-off threshold of false discovery rate (FDR) ≤ 0.01 was
used to obtain the significantly enriched pathways.
5.7.3 Association of sub-networks with patients’ survival outcome
In order to assess the association of identified subnetworks with patients’ survival outcome, we used
a risk-score based on the (weighted) aggregate expression of the genes in the subnetwork. The risk-
score (S) of a patient is defined as the sum of the normalized gene-expression values in the subnetwork,
each weighted by the estimated univariate Cox proportional-hazard regression coefficient [15], i.e., S =
∑ki βixi j. Here i and j represents a gene and a patient respectively, βi is the coefficient of Cox regression
for gene i, xi j is the normalized gene-expression of gene i in patient j, and k is the number of genes in the
subnetwork. The normalized gene-expression values were fitted against overall survival time with living
status as the censored event using univariate Cox proportional-hazard regression (exact method). Based
on the risk-score values, patients were stratified into two groups: low-risk group (patients with S < mean
of S), and high-risk group (patients with S ≥ mean of S). Note that only those patients that are covered
by the subnetwork are considered for the analysis above.
89
Figure 5.1: Schematic overview of cdCAP. Multi-omics alteration profiles of a cohort of tumorsamples are identified using appropriate bioinformatics tools. The alteration information iscombined with gene-level information in the form of a sample-gene alteration matrix. Eachalteration type is assigned a distinct color. Using a (signaling) interaction network, cd-CAPidentifies subnetworks with conserved alteration patterns.
90
Table 5.1: Five subnetworks identified by cd-CAP in multi-subnetwork mode for each cancer type:respective columns below depict the subnetwork size, depth, and the number of nodes in thesubnetwork with copy number amplification (AMP), expression increase (EXP-UP) or decrease(EXP-DOWN).
Cancer Network# Size Depth AMP EXP-UP EXP-DOWN1 6 206 1 5 02 11 152 6 5 0
COAD 3 12 137 7 3 24 15 149 11 1 35 15 223 2 10 31 4 72 4 0 02 4 69 4 0 0
GBM 3 9 67 9 0 04 16 70 1 15 05 36 96 1 32 31 8 164 7 0 12 10 332 1 9 0
BRCA 3 11 360 1 10 04 15 313 1 14 05 15 335 1 14 0
91
Figure 5.2: Conserved colored subnetworks. (A-D) Number of maximal solutions and the sizeof the conserved colored subnetwork obtained using the MCSI formulation, as a function ofnetwork depth t, in each of four cancer types analyzed, on STRING v10 (with high confidencenodes) PPI network . The horizontal axis denotes the depth (number of patients) of the net-work. For the blue plot, the vertical axis denotes the maximum possible network size (in termsof the number of nodes) and thus it is strictly non-increasing by definition. For the plots withdifferent colors, the vertical axis denotes the number of distinct networks with network sizeequal to that indicated by the blue plot. (E-G) One of the 11 maximal colored subnetworksidentified in BRCA Luminal A dataset. (E) The colored subnetwork (with 8 nodes) topology.(F) Pathways dysregulated by alterations harboured by the genes in the subnetwork - thesegenes are involved in EGFR, ERBB2, and FGFR signaling pathways. (G) Kaplan-Meier plotshowing the significant association of the subnetwork, with patients’ clinical outcome. (H-J)One of the 10 maximal colored subnetworks identified in COAD dataset. (H) The coloredsubnetwork (with 9 nodes) topology. (I) Pathways dysregulated by the alterations harbouredby the genes in the subnetwork - these genes are involved in signal transduction and apoptoticprocess. (J) Kaplan-Meier plot showing the significant association of the subnetwork withpatients’ clinical outcome (73 High Risk vs 83 Low Risk patients).
92
Figure 5.3: Colorful maximal subnetworks. (A-D) Number of maximal solutions and the sizeof the conserved colorful subnetwork obtained using the MCSI formulation, as a function ofnetwork depth t, in each of four cancer types analyzed on the STRING v10 (high confidenceedges) PPI network. The horizontal axis denotes the depth (number of patients) of the net-work. For the blue plot, the vertical axis denotes the maximum possible network size (interms of the number of nodes) and thus it is strictly non-increasing by definition. For the plotswith different colors, the vertical axis denotes the number of distinct networks with networksize equal to that indicated by the blue plot. (E-G) One of the maximal colorful subnetworksidentified in the COAD dataset with depth 108 (patients). (E) The colored subnetwork (with9 nodes) topology - obtained from STRING v10 (with experimentally validated edges) PPInetwork. (F) Pathways dysregulated by alterations harboured by the genes in the subnetwork.(G) Kaplan-Meier plot showing the significant association of the subnetwork, with patients’clinical outcome (59 High Risk vs 47 Low Risk patients). (H-J) One of the maximal color-ful subnetworks identified in the Luminal A dataset with no color restrictions, with depth of58 (patients). (H) The colored subnetwork (with 8 nodes) topology - obtained in the REAC-TOME PPI network. (I) Pathways dysregulated by the alterations harboured by the genes inthe subnetwork. (J) Kaplan-Meier plot showing the significant association of the subnetworkwith patients’ clinical outcome (30 High Risk vs 30 Low Risk patients).
93
Figure 5.4: Multiple subnetwork analysis. Two largest among the 15 subnetworks identifiedacross the COAD, GBM and BRCA data sets (5 per each) through the MCSC formulation ofcd-CAP on STRING v10.5 (with experimentally validated edges) PPI network. The numberin parenthesis next to each node represents the univariate Cox proportional-hazard regressioncoefficient estimated for that gene, used as its weight in the risk-score calculation to stratifythe patients into two distinct risk groups. (See section 5.7.3 for details). (A-C) The largest ofthe 5 COAD subnetworks with a network depth of 149 (patients). (A) The subnetwork topol-ogy (with 15 nodes). (B) Pathways dysregulated by alterations harboured by the genes in thesubnetwork. (C) Kaplan-Meier plot showing the significant association of the subnetwork,with patients’ clinical outcome (69 High Risk vs 78 Low Risk patients). (D-F) The largest ofthe 5 BRCA subnetworks with a network depth of 313 (patients). (D) The subnetwork topol-ogy (with 15 nodes). (E) Pathways dysregulated by the alterations harboured by the genes inthe subnetwork. (F) Kaplan-Meier plot showing the significant association of the subnetworkwith patients’ clinical outcome (33 High Risk vs 278 Low Risk patients).
94
Figure 5.5: Empirical p-value estimates for the maximum size subnetworks identified by cd-CAP. Compared with the subnetworks observed in real mutation profiles, those identified bycd-CAP in permutation tests (with identical t values) were much smaller, implying a p-valueof < 0.001 for each of the colored subnetworks presented in Figure 5.2.
95
Chapter 6
Conclusion
In recent years, there has been an unprecedented increase in the multi-dimensional high-throughput data
profiling (especially genome, transcriptome, proteome, and epigenome) of cancer patients. This has
revealed extensive mutational heterogeneity observed in the cancer (sub)types, yielding a long-tailed
distribution of mutated genes across the patients, implying the existence of many rare/private driver
genes. Thus, there is a great need for computational methods to mine these massive datasets and prioritize
clinically actionable driver events to aid treatment modalities using precision oncology.
The primary goal of this thesis was to develop novel computational algorithms to identify and priori-
tize cancer driver genes and provide insight into the heterogeneous biology to guide precision oncology.
We introduced HIT’nDRIVE, a combinatorial algorithm to prioritize cancer driver genes. HIT’nDRIVE
models the information flow connecting the genomic aberrations to the changes in global expression pat-
tern in the transcriptome. HIT’nDRIVE measures the potential impact of genomic aberrations on changes
in the global expression of other genes/proteins which are in close proximity in a gene/protein-interaction
network. HIT’nDRIVE then prioritizes those aberrations with the highest impact as cancer driver genes.
We formulated the driver prioritization problem as a “random-walk facility location” (RWFL) problem,
which differs from the standard facility location problem by its use of “hitting time”, the expected number
of hops to reach a “target” gene from a “source” gene, as a distance measure in an interaction network.
HIT’nDRIVE uses “inverse” hitting time as a measure of influence of a source gene over a target gene
to identify the subset of sequencewise altered/source genes whose overall influence over expression al-
tered/target genes is maximum possible.
We further demonstrated that HIT’nDRIVE accurately predicts patient-specific predicts cancer driver
genes. We also demonstrate that by using HIT’nDRIVE-identified driver genes and associated “network
modules” (sub-networks seeded by driver genes whose aggregate expression profiles correlate well with
96
the cancer phenotype) as features, it is possible to perform accurate phenotype classification. We also
demonstrate that these driver modules are associated with patients’ survival outcome and accurately pre-
dict drug efficacy in pan-cancer cell lines. Altogether, HIT’nDRIVE may help clinicians contextualize
massive multi-omics data in therapeutic decision making widespread implementation of precision oncol-
ogy possible.
In chapter 4, we described the first-in-field integrative multi-omics characterization of a cohort of ma-
lignant peritoneal mesothelioma (PeM). To our knowledge this is the largest cohort, of this rare tumor, to
be subjected to an integrative multi-omics analysis. We presented the integrated genome, transcriptome,
and proteome landscape. BAP1 loss of function is known to be a key driver event of PeM. However, the
downstream molecular and clinical significance of BAP1 loss has not been investigated in context of PeM
and we show that it is predictive for immunotherapy. We found that BAP1 loss forms a distinct molecular
subtype characterized by dysregulated gene expression patterns of chromatin remodeling, DNA repair
pathways, and immune checkpoint receptor activation. We further demonstrated that this subtype is
correlated with an inflammatory tumor microenvironment and thus a candidate for immune checkpoint
blockade therapies. Thus, BAP1 is a biomarker for PeM immunotherapy in 50% of cases we studied.
This is of critical importance because PeM is a rare and understudied cancer for which chemotherapy
and targeted therapies have proven ineffective. BAP1 stratification may improve drug response rates in
ongoing phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM.
In these BAP1 status is not currently taken into account.
Further, we resolved the discordance between mRNA and protein expression patterns in this cohort
and this may apply to other studies incorporating mass spectrometry. The discordance between mRNA
and protein levels was found to be due to multimeric protein complexes of chromatin remodeling genes
the majority of which are direct protein-interaction partners of BAP1. The discordance between the
mRNA and the protein expression patterns is most likely due to the ubiquitination and degradation of
proteins in these BAP1 regulated complexes to maintain functional stoichiometry.
Finally, in chapter 5, we introduced cd-CAP, a combinatorial algorithm to identify sub-networks
with conserved molecular alteration pattern across a large subset of a tumor sample cohort. cd-CAP
simultaneously identifies more than one subnetwork, and each gene within each subnetwork has labels
specific to the alteration types it harbors. Notably, we demonstrate that many of the largest highly
conserved subnetworks within a tumor type solely consist of genes that have been subject to copy number
gain, typically located on the same chromosomal arm and thus likely a result of a single, large scale
amplification. We have also demonstrated that the subnetworks identified using cd-CAP are associated
with patients’ survival outcome and hence are clinically important.
97
6.1 Future PerspectiveContinuous development and validation of novel algorithms for identification and prioritization of cancer
driver genes, especially rare driver genes, will be essential given the exponential growth of sequenced tu-
mors. Many studies over the past decade have focused on driver genomic aberration on the protein-coding
regions of the gene. Driver genes harbouring aberration in the non-coding regions are emerging. With
the rise of multi-omics data profiled for a given tumor, efficient computational algorithms to integrate
meaningful information from these multi-omics data together with curated knowledge of signaling path-
way/network will be necessary. Inclusion of epigenome (DNA methylation) and 3D genome interaction
data (Hi-C) data together with genome, transcriptome, and proteome would be necessary to understand
cancer initiation and progression. Furthermore, I believe, as the regulatory interaction-network covering
the non-coding genome will be available in the near future, this will trigger the next wave of algorithm
combining different types of data mentioned above.
Inference of sub-clonal population structure and identification of sub-clonal driver genes is another
avenue which is necessary for correct identification of driver genes. However, the shallow sequencing
depth of the available tumor whole genome sequences has remained as a bottleneck to correctly estimate
the correct sub-clonal population structure of tumor samples. Thus, I believe, as the high-throughput
sequencing cost further shrinks and ultra-high coverage genomes become more common, efficient com-
putational algorithms would be able to correctly identify sub-clonal driver genes providing further insight
into tumor evolution-guided clinically actionable targets.
In recent past, single-cell sequencing technology (single-cell DNA-seq, RNA-seq, and methylation
profiling) has surged as a promising technique to study molecular changes at a single-cell resolution. I
believe, advances in development of computational tools to analyze single-cell sequencing data for re-
solving intra-tumor heterogeneity, spatial heterogeneity, and reconstructing sub-clonal population struc-
ture in tumor will provide new insights in oncology research.
On the other hand, Deep Neural Network (also known as deep learning) has been recognized as an
efficient approach for learning the functional relationships between different types of related data. Al-
though in its infancy, one of deep neural network based methods, IBM Watson Oncology (https://www.ibm.com/watson/),
is being tested for its utility in cancer therapeutics across research centres around the world. Similarly,
Google’s DeepMind Health (https://deepmind.com) is being tested to mine patients’ medical reports to
predict appropriate treatment in the UK. I believe, combining the algorithmic approaches described in
this thesis together with deep neural network approach would help such computational tools to become
more powerful and robust which critical for precision oncology.
98
Bibliography
[1] AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an InternationalConsortium. Cancer discovery, 7(8):818–831, 2017. ISSN 2159-8290. doi:10.1158/2159-8290.CD-17-0151. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28572459. → pages 53
[2] I. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev. Amethod and server for predicting damaging missense mutations. Nature methods, 7(4):248–249, 2010. ISSN1548-7091. doi:10.1038/nmeth0410-248. URL https://www.ncbi.nlm.nih.gov/pubmed/20354512. → pages 3
[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20thInternational Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994.Morgan Kaufmann Publishers Inc. ISBN 1-55860-153-8. → pages 77, 80
[4] U. D. Akavia, O. Litvin, J. Kim, F. Sanchez-Garcia, D. Kotliar, et al. An integrated approach to uncover drivers ofcancer. Cell, 143(6):1005–17, Dec. 2010. ISSN 1097-4172. doi:10.1016/j.cell.2010.11.013. → pages 4
[5] H. Alakus, S. E. Yost, B. Woo, R. French, G. Y. Lin, K. Jepsen, K. A. Frazer, A. M. Lowy, and O. Harismendy. BAP1mutation is a frequent somatic event in peritoneal malignant mesothelioma. Journal of translational medicine, 13(1):122, 2015. ISSN 1479-5876. doi:10.1186/s12967-015-0485-1. URL https://www.ncbi.nlm.nih.gov/pubmed/25889843.→ pages 51, 53
[6] N. Alcaraz, T. Friedrich, T. Kotzing, A. Krohmer, J. Muller, J. Pauling, and J. Baumbach. Efficient key pathwaymining: combining networks and OMICS data. Integrative biology : quantitative biosciences from nano to macro, 4(7):756–64, jul 2012. ISSN 1757-9708. doi:10.1039/c2ib00133k. URL http://www.ncbi.nlm.nih.gov/pubmed/22353882.→ pages 76, 87
[7] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, S. a. J. R. Aparicio, S. Behjati, A. V. Biankin, G. R. Bignell, N. Bolli,A. Borg, A.-L. Børresen-Dale, S. Boyault, B. Burkhardt, A. P. Butler, C. Caldas, H. R. Davies, C. Desmedt, R. Eils,J. E. Eyfjord, J. a. Foekens, M. Greaves, F. Hosoda, B. Hutter, T. Ilicic, S. Imbeaud, M. Imielinski, M. Imielinsk,N. Jager, D. T. W. Jones, D. Jones, S. Knappskog, M. Kool, S. R. Lakhani, C. Lopez-Otın, S. Martin, N. C. Munshi,H. Nakamura, P. a. Northcott, M. Pajic, E. Papaemmanuil, A. Paradiso, J. V. Pearson, X. S. Puente, K. Raine,M. Ramakrishna, A. L. Richardson, J. Richter, P. Rosenstiel, M. Schlesner, T. N. Schumacher, P. N. Span, J. W. Teague,Y. Totoki, A. N. J. Tutt, R. Valdes-Mas, M. M. van Buuren, L. van ’t Veer, A. Vincent-Salomon, N. Waddell, L. R.Yates, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-SeqConsortium, ICGC PedBrain, J. Zucman-Rossi, P. A. Futreal, U. McDermott, P. Lichter, M. Meyerson, S. M.Grimmond, R. Siebert, E. Campo, T. Shibata, S. M. Pfister, P. J. Campbell, and M. R. Stratton. Signatures of mutationalprocesses in human cancer. Nature, 500(7463):415–21, aug 2013. ISSN 1476-4687. doi:10.1038/nature12477. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23945592. → pages 52, 75
[8] L. B. Alexandrov, P. H. Jones, D. C. Wedge, J. E. Sale, P. J. Campbell, S. Nik-Zainal, and M. R. Stratton. Clock-likemutational processes in human somatic cells. Nature Genetics, 47(12):1402–1407, 2015. ISSN 15461718.doi:10.1038/ng.3441. URL https://www.ncbi.nlm.nih.gov/pubmed/26551669. → pages 67
99
[9] E. W. Alley, J. Lopez, A. Santoro, A. Morosky, S. Saraf, B. Piperdi, and E. van Brummelen. Clinical safety and activityof pembrolizumab in patients with malignant pleural mesothelioma (KEYNOTE-028): preliminary results from anon-randomised, open-label, phase 1b trial. The Lancet Oncology, 18(5):623–630, 2017. ISSN 14745488.doi:10.1016/S1470-2045(17)30169-9. URL https://www.ncbi.nlm.nih.gov/pubmed/28291584. → pages 61
[10] S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome biology, 11(10):R106,2010. ISSN 1474-760X. doi:10.1186/gb-2010-11-10-r106. URL http://www.ncbi.nlm.nih.gov/pubmed/209796212. →pages 65
[11] S. Anders, P. T. Pyl, and W. Huber. HTSeq-A Python framework to work with high-throughput sequencing data.Bioinformatics, 31(2):166–169, 2015. ISSN 14602059. doi:10.1093/bioinformatics/btu638. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25260700. → pages 65
[12] M. H. Bailey, C. Tokheim, E. Porta-Pardo, S. Sengupta, D. Bertrand, A. Weerasinghe, A. Colaprico, M. C. Wendl,J. Kim, B. Reardon, P. K.-S. Ng, K. J. Jeong, S. Cao, Z. Wang, J. Gao, Q. Gao, F. Wang, E. M. Liu, L. Mularoni,C. Rubio-Perez, N. Nagarajan, I. Cortes-Ciriano, D. C. Zhou, W.-W. Liang, J. M. Hess, V. D. Yellapantula,D. Tamborero, A. Gonzalez-Perez, C. Suphavilai, J. Y. Ko, E. Khurana, P. J. Park, E. M. Van Allen, H. Liang, MC3Working Group, Cancer Genome Atlas Research Network, M. S. Lawrence, A. Godzik, N. Lopez-Bigas, J. Stuart,D. Wheeler, G. Getz, K. Chen, A. J. Lazar, G. B. Mills, R. Karchin, and L. Ding. Comprehensive Characterization ofCancer Driver Genes and Mutations. Cell, 173(2):371–385.e18, apr 2018. ISSN 1097-4172.doi:10.1016/j.cell.2018.02.060. URL http://www.ncbi.nlm.nih.gov/pubmed/29625053. → pages 34, 75
[13] C. E. Barbieri, S. C. Baca, M. S. Lawrence, F. Demichelis, M. Blattner, J.-P. Theurillat, T. a. White, P. Stojanov, E. VanAllen, N. Stransky, E. Nickerson, S.-S. Chae, G. Boysen, D. Auclair, R. C. Onofrio, K. Park, N. Kitabayashi, T. Y.MacDonald, K. Sheikh, T. Vuong, C. Guiducci, K. Cibulskis, A. Sivachenko, S. L. Carter, G. Saksena, D. Voet, W. M.Hussain, A. H. Ramos, W. Winckler, M. C. Redman, K. Ardlie, A. K. Tewari, J. M. Mosquera, N. Rupp, P. J. Wild,H. Moch, C. Morrissey, P. S. Nelson, P. W. Kantoff, S. B. Gabriel, T. R. Golub, M. Meyerson, E. S. Lander, G. Getz,M. a. Rubin, and L. a. Garraway. Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations inprostate cancer. Nature genetics, 44(6):685–9, jun 2012. ISSN 1546-1718. doi:10.1038/ng.2279. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22610119. → pages 36
[14] A. Bashashati, G. Haffari, J. Ding, G. Ha, K. Lui, et al. DriverNet: uncovering the impact of somatic driver mutationson transcriptional networks in cancer. Genome biology, 13(12):R124, Dec. 2012. ISSN 1465-6914.doi:10.1186/gb-2012-13-12-r124. → pages 6, 19
[15] D. G. Beer, S. L. R. Kardia, C.-C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib,D. G. Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. G. Taylor, M. D. Iannettoni, M. B. Orringer, andS. Hanash. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 8(8):816–24, aug 2002. ISSN 1078-8956. doi:10.1038/nm733. → pages 44, 89
[16] H. Beltran, K. Eng, J. M. Mosquera, A. Sigaras, A. Romanel, H. Rennert, M. Kossai, C. Pauli, B. Faltas, J. Fontugne,K. Park, J. Banfelder, D. Prandi, N. Madhukar, T. Zhang, J. Padilla, N. Greco, T. J. McNary, E. Herrscher, D. Wilkes,T. Y. MacDonald, H. Xue, V. Vacic, A.-K. Emde, D. Oschwald, A. Y. Tan, Z. Chen, C. Collins, M. E. Gleave, Y. Wang,D. Chakravarty, M. Schiffman, R. Kim, F. Campagne, B. D. Robinson, D. M. Nanus, S. T. Tagawa, J. Z. Xiang,A. Smogorzewska, F. Demichelis, D. S. Rickman, A. Sboner, O. Elemento, and M. a. Rubin. Whole-ExomeSequencing of Metastatic Cancer and Biomarkers of Treatment Response. JAMA Oncology, 10021, 2015. ISSN2374-2437. doi:10.1001/jamaoncol.2015.1313. URL https://www.ncbi.nlm.nih.gov/pubmed/26181256. → pages 43
[17] G. Bianchini, J. M. Balko, I. A. Mayer, M. E. Sanders, and L. Gianni. Triple-negative breast cancer: challenges andopportunities of a heterogeneous disease. Nature reviews. Clinical oncology, may 2016. ISSN 1759-4782.doi:10.1038/nrclinonc.2016.66. URL http://www.ncbi.nlm.nih.gov/pubmed/27184417. → pages 39
100
[18] A. Bomersbach, M. Chiarandini, and F. Vandin. An Efficient Branch and Cut Algorithm to Find Frequently MutatedSubnetworks in Cancer. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms in Bioinformatics, pages 27–39,Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4. doi:10.1007/978-3-319-43681-4 3. URLhttp://link.springer.com/10.1007/978-3-642-33122-0. → pages 76, 79, 80, 88
[19] M. Bott, M. Brevet, B. S. Taylor, S. Shimizu, T. Ito, L. Wang, J. Creaney, R. a. Lake, M. F. Zakowski, B. Reva,C. Sander, R. Delsite, S. Powell, Q. Zhou, R. Shen, A. Olshen, V. Rusch, and M. Ladanyi. The nuclear deubiquitinaseBAP1 is commonly inactivated by somatic mutations and 3p21.1 losses in malignant pleural mesothelioma. Naturegenetics, 43(7):668–672, 2011. ISSN 1061-4036. doi:10.1038/ng.855. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21642991. → pages 53
[20] N. J. Bowen, L. D. Walker, L. V. Matyunina, S. Logani, K. a. Totten, B. B. Benigno, and J. F. McDonald. Geneexpression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of servingas ovarian cancer initiating cells. BMC medical genomics, 2:71, 2009. ISSN 1755-8794. doi:10.1186/1755-8794-2-71.→ pages 24
[21] S. E. Bowyer, A. D. Rao, M. Lyle, S. Sandhu, G. V. Long, G. a. McArthur, J. M. Raleigh, R. J. Hicks, and M. Millward.Activity of trametinib in K601E and L597Q BRAF mutation-positive metastatic melanoma. Melanoma research, 24(5):504–8, 2014. ISSN 1473-5636. doi:10.1097/CMR.0000000000000099. → pages 38
[22] C. W. Brennan, R. G. W. Verhaak, A. McKenna, B. Campos, H. Noushmehr, S. R. Salama, S. Zheng, D. Chakravarty,J. Z. Sanborn, S. H. Berman, R. Beroukhim, B. Bernard, C.-J. Wu, G. Genovese, I. Shmulevich, J. Barnholtz-Sloan,L. Zou, R. Vegesna, S. a. Shukla, G. Ciriello, W. K. Yung, W. Zhang, C. Sougnez, T. Mikkelsen, K. Aldape, D. D.Bigner, E. G. Van Meir, M. Prados, A. Sloan, K. L. Black, J. Eschbacher, G. Finocchiaro, W. Friedman, D. W.Andrews, A. Guha, M. Iacocca, B. P. O’Neill, G. Foltz, J. Myers, D. J. Weisenberger, R. Penny, R. Kucherlapati, C. M.Perou, D. N. Hayes, R. Gibbs, M. Marra, G. B. Mills, E. Lander, P. Spellman, R. Wilson, C. Sander, J. Weinstein,M. Meyerson, S. Gabriel, P. W. Laird, D. Haussler, G. Getz, L. Chin, and TCGA Research Network. The somaticgenomic landscape of glioblastoma. Cell, 155(2):462–77, oct 2013. ISSN 1097-4172. doi:10.1016/j.cell.2013.09.034.→ pages 86
[23] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDNSystems, 30(1-7):107–117, apr 1998. ISSN 01697552. doi:10.1016/S0169-7552(98)00110-X. URLhttp://dx.doi.org/10.1016/S0169-7552(98)00110-X. → pages 6
[24] R. Bueno, E. W. Stawiski, L. D. Goldstein, S. Durinck, A. De Rienzo, Z. Modrusan, F. Gnad, T. T. Nguyen, B. S.Jaiswal, L. R. Chirieac, D. Sciaranghella, N. Dao, C. E. Gustafson, K. J. Munir, J. A. Hackney, A. Chaudhuri, R. Gupta,J. Guillory, K. Toy, C. Ha, Y.-J. Chen, J. Stinson, S. Chaudhuri, N. Zhang, T. D. Wu, D. J. Sugarbaker, F. J. de Sauvage,W. G. Richards, and S. Seshagiri. Comprehensive genomic analysis of malignant pleural mesothelioma identifiesrecurrent mutations, gene fusions and splicing alterations. Nature Genetics, 48(October 2015):1–13, 2016. ISSN1061-4036. doi:10.1038/ng.3520. URL http://www.ncbi.nlm.nih.gov/pubmed/26928227. → pages 53, 55
[25] L. Calabro, A. Morra, E. Fonsatti, O. Cutaia, G. Amato, D. Giannarelli, A. M. Di Giacomo, R. Danielli, M. Altomonte,L. Mutti, and M. Maio. Tremelimumab for patients with chemotherapy-resistant advanced malignant mesothelioma:An open-label, single-arm, phase 2 trial. The Lancet Oncology, 14(11):1104–1111, 2013. ISSN 14702045.doi:10.1016/S1470-2045(13)70381-4. URL https://www.ncbi.nlm.nih.gov/pubmed/24035405. → pages 51, 61
[26] L. Calabro, A. Morra, E. Fonsatti, O. Cutaia, C. Fazio, D. Annesi, M. Lenoci, G. Amato, R. Danielli, M. Altomonte,D. Giannarelli, A. M. Di Giacomo, and M. Maio. Efficacy and safety of an intensified schedule of tremelimumab forchemotherapy-resistant malignant mesothelioma: An open-label, single-arm, phase 2 study. The Lancet RespiratoryMedicine, 3(4):301–309, 2015. ISSN 22132619. doi:10.1016/S2213-2600(15)00092-2. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25819643. → pages 61
101
[27] L. Calabro, A. Morra, D. Giannarelli, G. Amato, A. D’Incecco, A. Covre, A. Lewis, M. C. Rebelatto, R. Danielli,M. Altomonte, A. M. Di Giacomo, and M. Maio. Tremelimumab combined with durvalumab in patients withmesothelioma (NIBIT-MESO-1): an open-label, non-randomised, phase 2 study. The Lancet. Respiratory medicine,2600(18):1–10, may 2018. ISSN 2213-2619. doi:10.1016/S2213-2600(18)30151-6. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29773326. → pages 51
[28] P. J. Campbell. Cliques and Schisms of Cancer Genes. Cancer Cell, 32(2):129–130, 2017. ISSN 18783686.doi:10.1016/j.ccell.2017.07.009. URL http://dx.doi.org/10.1016/j.ccell.2017.07.009. → pages 76
[29] H. Carter, S. Chen, L. Isik, S. Tyekucheva, V. E. Velculescu, K. W. Kinzler, B. Vogelstein, and R. Karchin.Cancer-specific high-throughput annotation of somatic mutations: Computational prediction of driver missensemutations. Cancer Research, 69(16):6660–6667, 2009. ISSN 00085472. doi:10.1158/0008-5472.CAN-09-1133. URLhttps://www.ncbi.nlm.nih.gov/pubmed/19654296. → pages 3
[30] L. Carvallo, R. Munoz, F. Bustos, N. Escobedo, H. Carrasco, G. Olivares, and J. Larraın. Non-canonical Wnt signalinginduces ubiquitination and degradation of Syndecan4. The Journal of biological chemistry, 285(38):29546–55, sep2010. ISSN 1083-351X. doi:10.1074/jbc.M110.155812. URL http://www.ncbi.nlm.nih.gov/pubmed/20639201. →pages 84
[31] B. S. Carver, J. Tran, A. Gopalan, Z. Chen, S. Shaikh, A. Carracedo, A. Alimonti, C. Nardella, S. Varmeh, P. T.Scardino, C. Cordon-Cardo, W. Gerald, and P. P. Pandolfi. Aberrant ERG expression cooperates with loss of PTEN topromote cancer progression in the prostate. Nature Genetics, 41(5):619–624, 2009. ISSN 1061-4036.doi:10.1038/ng.370. URL https://www.ncbi.nlm.nih.gov/pubmed/19396168. → pages 75
[32] E. Cerami, E. Demir, N. Schultz, B. S. Taylor, and C. Sander. Automated network analysis identifies core pathways inglioblastoma. PLoS ONE, 5(2), 2010. ISSN 19326203. doi:10.1371/journal.pone.0008918. → pages 4
[33] F. Chen, Y. Zhang, Y. Senbabaoglu, G. Ciriello, L. Yang, E. Reznik, B. Shuch, G. Micevic, G. De Velasco, E. Shinbrot,M. S. Noble, Y. Lu, K. R. Covington, L. Xi, J. A. Drummond, D. Muzny, H. Kang, J. Lee, P. Tamboli, V. Reuter, C. S.Shelley, B. A. Kaipparettu, D. P. Bottaro, A. K. Godwin, R. A. Gibbs, G. Getz, R. Kucherlapati, P. J. Park, C. Sander,E. P. Henske, J. H. Zhou, D. J. Kwiatkowski, T. H. Ho, T. K. Choueiri, J. J. Hsieh, R. Akbani, G. B. Mills, A. A.Hakimi, D. A. Wheeler, and C. J. Creighton. Multilevel Genomics-Based Taxonomy of Renal Cell Carcinoma. CellReports, 14(10):2476–2489, 2016. ISSN 22111247. doi:10.1016/j.celrep.2016.02.024. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26947078. → pages 60, 75
[34] P. Chirac, D. Maillet, F. Lepretre, S. Isaac, O. Glehen, M. Figeac, L. Villeneuve, J. Peron, F. Gibson, F. Galateau-Salle,F. N. Gilly, and M. Brevet. Genomic copy number alterations in 33 malignant peritoneal mesothelioma analyzed bycomparative genomic hybridization array. Human Pathology, 55:72–82, 2016. ISSN 15328392.doi:10.1016/j.humpath.2016.04.015. URL https://www.ncbi.nlm.nih.gov/pubmed/27184482. → pages 51
[35] D.-Y. Cho, Y.-A. Kim, and T. M. Przytycka. Chapter 5: Network Biology Approach to Complex Diseases. PLoSComputational Biology, 8(12):e1002820, Dec. 2012. ISSN 1553-7358. doi:10.1371/journal.pcbi.1002820. URLhttp://dx.plos.org/10.1371/journal.pcbi.1002820. → pages 5
[36] S. A. Chowdhury, S. E. Shackney, K. Heselmeyer-Haddad, T. Ried, A. A. Schaffer, and R. Schwartz. Algorithms toModel Single Gene, Single Chromosome, and Whole Genome Copy Number Changes Jointly in Tumor Phylogenetics.PLoS Computational Biology, 10(7), 2014. ISSN 15537358. doi:10.1371/journal.pcbi.1003740. → pages 78
[37] G. Ciriello, E. Cerami, C. Sander, and N. Schultz. Mutual exclusivity analysis identifies oncogenic network modules.Genome research, 22(2):398–406, Feb. 2012. ISSN 1549-5469. doi:10.1101/gr.125567.111. → pages 4, 76
102
[38] G. Ciriello, M. L. Miller, B. A. Aksoy, Y. Senbabaoglu, N. Schultz, and C. Sander. Emerging landscape of oncogenicsignatures across human cancers. Nature genetics, 45(10):1127–1133, sep 2013. ISSN 1546-1718.doi:10.1038/ng.2762. → pages 36
[39] S. Condamin, O. Benichou, V. Tejedor, R. Voituriez, and J. Klafter. First-passage times in complex scale-invariantmedia. Nature, 450(7166):77–80, 2007. ISSN 0028-0836. doi:10.1038/nature06201. → pages 7, 12
[40] G. Cormode, G. Cormode, M. Paterson, M. Paterson, S. Sahinalp, S. Sahinalp, U. Vishkin, and U. Vishkin.Communication complexity of document exchange. In Proceedings of the eleventh annual ACM-SIAM symposium onDiscrete algorithms, pages 197–206, Philadelphia, 2000. Society for Industrial and Applied Mathematics. ISBN0-89871-453-2. URL http://portal.acm.org/citation.cfm?id=338219.338252. → pages 78
[41] L. Cowen, T. Ideker, B. J. Raphael, and R. Sharan. Network propagation: a universal amplifier of genetic associations.Nature reviews. Genetics, 18(9):551–562, 2017. ISSN 1471-0064. doi:10.1038/nrg.2017.38. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28607512. → pages 5
[42] C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa,Y. Yuan, S. Graf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, A. Langerød, A. Green, E. Provenzano,G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Børresen-Dale, J. D.Brenton, S. Tavare, C. Caldas, and S. Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumoursreveals novel subgroups. Nature, 486(7403):346–52, jun 2012. ISSN 1476-4687. doi:10.1038/nature10983. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22522925. → pages 24, 39, 78, 84, 85, 88
[43] K. B. Dahlman, J. Xia, K. Hutchinson, C. Ng, D. Hucks, P. Jia, M. Atefi, Z. Su, S. Branch, P. L. Lyle, D. J. Hicks,V. Bozon, J. A. Glaspy, N. Rosen, D. B. Solit, J. L. Netterville, C. L. Vnencak-Jones, J. A. Sosman, A. Ribas, Z. Zhao,and W. Pao. BRAFL597 mutations in melanoma are associated with sensitivity to MEK inhibitors. Cancer Discovery,2(9):791–797, 2012. ISSN 21598274. doi:10.1158/2159-8290.CD-12-0097. → pages 38
[44] P. Dao, K. Wang, C. Collins, M. Ester, A. Lapuk, and S. C. Sahinalp. Optimally discriminative subnetwork markerspredict response to chemotherapy. Bioinformatics, 27(13), Jul 2011. → pages 14
[45] P. Dao, Y.-A. Kim, D. Wojtowicz, S. Madan, R. Sharan, and T. M. Przytycka. BeWith: A Between-Within method todiscover relationships between cancer modules via integrated analysis of mutual exclusivity, co-occurrence andfunctional interactions. PLoS computational biology, 13(10):e1005695, oct 2017. ISSN 1553-7358.doi:10.1371/journal.pcbi.1005695. → pages 76
[46] N. D. Dees, Q. Zhang, C. Kandoth, M. C. Wendl, W. Schierding, D. C. Koboldt, T. B. Mooney, M. B. Callaway,D. Dooling, E. R. Mardis, R. K. Wilson, and L. Ding. MuSiC: Identifying mutational significance in cancer genomes.Genome Research, 22(8):1589–1598, 2012. ISSN 10889051. doi:10.1101/gr.134635.111. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22759861. → pages 2
[47] M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. del Angel, M. A.Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel,D. Altshuler, and M. J. Daly. A framework for variation discovery and genotyping using next-generation DNAsequencing data. Nature Genetics, 43(5):491–498, 2011. ISSN 1061-4036. doi:10.1038/ng.806. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21478889. → pages 63
[48] J. Ding, M. K. McConechy, H. M. Horlings, G. Ha, F. Chun Chan, T. Funnell, S. C. Mullaly, J. Reimand,A. Bashashati, G. D. Bader, D. Huntsman, S. Aparicio, A. Condon, and S. P. Shah. Systematic analysis of somaticmutations impacting gene expression in 12 tumour types. Nature Communications, 6(1):8554, dec 2015. ISSN2041-1723. doi:10.1038/ncomms9554. URL http://www.ncbi.nlm.nih.gov/pubmed/26436532. → pages 4
103
[49] L. Ding, T. J. Ley, D. E. Larson, C. a. Miller, D. C. Koboldt, et al. Clonal evolution in relapsed acute myeloidleukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–10, Jan. 2012. ISSN 1476-4687.doi:10.1038/nature10738. URL https://www.ncbi.nlm.nih.gov/pubmed/22237025. → pages 3
[50] A. Dobin, C. a. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. STAR:Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. ISSN 13674803.doi:10.1093/bioinformatics/bts635. URL https://www.ncbi.nlm.nih.gov/pubmed/23104886. → pages 64
[51] B. Dutta, L. Pusztai, Y. Qi, F. Andre, V. Lazar, G. Bianchini, N. Ueno, R. Agarwal, B. Wang, C. Y. Shiang, G. N.Hortobagyi, G. B. Mills, W. F. Symmans, and G. Balazsi. A network-based, integrative study to identify core biologicalpathways that drive breast cancer clinical subtypes. British journal of cancer, 106(6):1107–16, mar 2012. ISSN1532-1827. doi:10.1038/bjc.2011.584. URL http://www.ncbi.nlm.nih.gov/pubmed/22343619. → pages 39
[52] M. Dyer and A. Frieze. On the complexity of partitioning graphs into connected subgraphs. Discrete AppliedMathematics, 10(2):139 – 153, 1985. ISSN 0166-218X. doi:10.1016/0166-218X(85)90008-3. URLhttp://www.sciencedirect.com/science/article/pii/0166218X85900083. → pages 80
[53] M. El-Kebir and G. W. Klau. Solving the Maximum-Weight Connected Subgraph Problem to Optimality. arXiv, pages1–32, sep 2014. URL http://arxiv.org/abs/1409.5308. → pages 76
[54] M. El-kebir, B. J. Raphael, R. Shamir, R. Sharan, S. Zaccaria, M. Zehavi, and R. Zeira. Copy-Number EvolutionProblems: Complexity and Algorithms. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms inBioinformatics, pages 137–149, Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4.doi:10.1007/978-3-319-43681-4 11. URL http://link.springer.com/10.1007/978-3-642-33122-0. → pages 78
[55] A. Fabregat, K. Sidiropoulos, P. Garapati, M. Gillespie, K. Hausmann, R. Haw, B. Jassal, S. Jupe, F. Korninger,S. McKay, L. Matthews, B. May, M. Milacic, K. Rothfels, V. Shamovsky, M. Webber, J. Weiser, M. Williams, G. Wu,L. Stein, H. Hermjakob, and P. D’Eustachio. The Reactome pathway Knowledgebase. Nucleic acids research, 44(D1):D481–7, jan 2016. ISSN 1362-4962. doi:10.1093/nar/gkv1351. URL http://www.ncbi.nlm.nih.gov/pubmed/24243840.→ pages 24
[56] D. A. Fennell, E. Kirkpatrick, K. Cozens, M. Nye, J. Lester, G. Hanna, N. Steele, P. Szlosarek, S. Danson, J. Lord,C. Ottensmeier, D. Barnes, S. Hill, M. Kalevras, T. Maishman, and G. Griffiths. CONFIRM: a double-blind,placebo-controlled phase III clinical trial investigating the effect of nivolumab in patients with relapsed mesothelioma:study protocol for a randomised controlled trial. Trials, 19(1):233, apr 2018. ISSN 1745-6215.doi:10.1186/s13063-018-2602-y. URL http://www.ncbi.nlm.nih.gov/pubmed/29669604. → pages 51
[57] A. Fernandez-Medarde and E. Santos. Ras in cancer and developmental diseases. Genes & cancer, 2(3):344–58, mar2011. ISSN 1947-6027. doi:10.1177/1947601911411084. URL http://www.ncbi.nlm.nih.gov/pubmed/21779504. →pages 84
[58] S. A. Forbes, D. Beare, H. Boutselakis, S. Bamford, N. Bindal, J. Tate, C. G. Cole, S. Ward, E. Dawson, L. Ponting,R. Stefancsik, B. Harsha, C. YinKok, M. Jia, H. Jubb, Z. Sondka, S. Thompson, T. De, and P. J. Campbell. COSMIC:Somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):D777–D783, 2017. ISSN 13624962.doi:10.1093/nar/gkw1121. → pages 35, 52
[59] P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman, and M. R. Stratton. A census ofhuman cancer genes. Nature reviews. Cancer, 4(3):177–83, mar 2004. ISSN 1474-175X. doi:10.1038/nrc1299. →pages 16, 34, 35
[60] G. Germano, S. Lamba, G. Rospo, L. Barault, A. Magrı, F. Maione, M. Russo, G. Crisafulli, A. Bartolini, G. Lerda,G. Siravegna, B. Mussolin, R. Frapolli, M. Montone, F. Morano, F. de Braud, N. Amirouchene-Angelozzi, S. Marsoni,
104
M. D’Incalci, A. Orlandi, E. Giraudo, A. Sartore-Bianchi, S. Siena, F. Pietrantonio, F. Di Nicolantonio, and A. Bardelli.Inactivation of DNA repair triggers neoantigen generation and impairs tumour growth. Nature, 2017. ISSN 0028-0836.doi:10.1038/nature24673. URL https://www.ncbi.nlm.nih.gov/pubmed/29186113. → pages 60
[61] E. E. Gill, L. S. Chan, G. L. Winsor, N. Dobson, R. Lo, S. J. Ho Sui, B. K. Dhillon, P. K. Taylor, R. Shrestha,C. Spencer, R. E. W. Hancock, P. J. Unrau, and F. S. L. Brinkman. High-throughput detection of RNA processing inbacteria. BMC genomics, 19(1):223, 2018. ISSN 1471-2164. doi:10.1186/s12864-018-4538-8. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29587634. → pages 9
[62] E. Goncalves, A. Fragoulis, L. Garcia-Alonso, T. Cramer, J. Saez-Rodriguez, and P. Beltrao. WidespreadPost-transcriptional Attenuation of Genomic Copy-Number Variation in Cancer. Cell Systems, 0(0):1–13, 2017. ISSN24054712. doi:10.1016/j.cels.2017.08.013. URL https://www.ncbi.nlm.nih.gov/pubmed/29032074. → pages 57, 58
[63] A. Gonzalez-Perez and N. Lopez-Bigas. Improving the assessment of the outcome of nonsynonymous SNVs with aconsensus deleteriousness score, Condel. American Journal of Human Genetics, 88(4):440–449, 2011. ISSN00029297. doi:10.1016/j.ajhg.2011.03.004. URL https://www.ncbi.nlm.nih.gov/pubmed/21457909. → pages 3
[64] A. Gonzalez-Perez and N. Lopez-Bigas. Functional impact bias reveals cancer drivers. Nucleic Acids Research, 40(21):1–10, 2012. ISSN 03051048. doi:10.1093/nar/gks743. URL https://www.ncbi.nlm.nih.gov/pubmed/22904074. →pages 3
[65] A. Gonzalez-Perez, J. Deu-Pons, and N. Lopez-Bigas. Improving the prediction of the functional impact of cancermutations by baseline tolerance transformation. Genome medicine, 4(11):89, 2012. ISSN 1756-994X.doi:10.1186/gm390. URL https://www.ncbi.nlm.nih.gov/pubmed/23181723. → pages 3
[66] C. S. Grasso, Y.-M. Wu, D. R. Robinson, X. Cao, S. M. Dhanasekaran, A. P. Khan, M. J. Quist, X. Jing, R. J. Lonigro,J. C. Brenner, I. a. Asangani, B. Ateeq, S. Y. Chun, J. Siddiqui, L. Sam, M. Anstett, R. Mehra, J. R. Prensner,N. Palanisamy, G. a. Ryslik, F. Vandin, B. J. Raphael, L. P. Kunju, D. R. Rhodes, K. J. Pienta, A. M. Chinnaiyan, andS. a. Tomlins. The mutational landscape of lethal castration-resistant prostate cancer. Nature, 487(7406):239–43, jul2012. ISSN 1476-4687. doi:10.1038/nature11125. URL https://www.ncbi.nlm.nih.gov/pubmed/22722839. → pages 24
[67] M. Greaves and C. C. Maley. Clonal evolution in cancer. Nature, 481(7381):306–13, Jan. 2012. ISSN 1476-4687.doi:10.1038/nature10762. URL https://www.ncbi.nlm.nih.gov/pubmed/22258609. → pages 1, 3
[68] C. Greenman, R. Wooster, P. A. Futreal, M. R. Stratton, and D. F. Easton. Statistical analysis of pathogenicity ofsomatic mutations in cancer. Genetics, 173(4):2187–98, Aug. 2006. ISSN 0016-6731.doi:10.1534/genetics.105.044677. URL https://www.ncbi.nlm.nih.gov/pubmed/16783027. → pages 2
[69] C. Greenman, P. Stephens, R. Smith, G. L. Dalgliesh, C. Hunter, et al. Patterns of somatic mutation in human cancergenomes. Nature, 446(7132):153–8, Mar. 2007. ISSN 1476-4687. doi:10.1038/nature05610. URLhttps://www.ncbi.nlm.nih.gov/pubmed/17344846. → pages 1, 10
[70] M. Griffith, O. L. Griffith, A. C. Coffman, J. V. Weible, J. F. McMichael, N. C. Spies, J. Koval, I. Das, M. B. Callaway,J. M. Eldred, C. a. Miller, J. Subramanian, R. Govindan, R. D. Kumar, R. Bose, L. Ding, J. R. Walker, D. E. Larson,D. J. Dooling, S. M. Smith, T. J. Ley, E. R. Mardis, and R. K. Wilson. DGIdb: mining the druggable genome. Naturemethods, 10(12):1209–10, 2013. ISSN 1548-7105. doi:10.1038/nmeth.2689. → pages 35
[71] A. Gupta, M. M. Hossain, N. Miller, M. Kerin, G. Callagy, and S. Gupta. NCOA3 coactivator is a transcriptional targetof XBP1 and regulates PERK-eIF2α-ATF4 signalling in breast cancer. Oncogene, 35(October 2015):1–12, apr 2016.ISSN 1476-5594. doi:10.1038/onc.2016.121. URL http://www.ncbi.nlm.nih.gov/pubmed/27109102. → pages 40
[72] D. Hanahan and R. a. Weinberg. Hallmarks of cancer: the next generation. Cell, 144(5):646–74, mar 2011. ISSN1097-4172. doi:10.1016/j.cell.2011.02.013. URL http://www.ncbi.nlm.nih.gov/pubmed/21376230. → pages 1
105
[73] E. Hodzic, R. Shrestha, K. Zhu, K. Cheng, C. C. Collins, and S. C. Sahinalp. Combinatorial detection of conservedalteration patterns for identifying cancer subnetworks. bioRxiv, 2018. doi:10.1101/369850. URLhttps://doi.org/10.1101/369850. → pages
[74] J. Hopcroft and D. Sheldon. Manipulation-resistant reputations using hitting time. In Algorithms and Models for theWeb-Graph, pages 68–81. Springer, 2007. → pages 22
[75] F. Hormozdiari, C. Alkan, E. E. Eichler, and S. C. Sahinalp. Combinatorial algorithms for structural variation detectionin high-throughput sequenced genomes. Genome research, 19(7):1270–1278, July 2009. → pages 13
[76] B. H. Hristov and M. Singh. Network-based coverage of mutational profiles reveals cancer genes. Cell Systems, 5(3):221–229.e4, 2017. ISSN 16113349. doi:10.1016/j.cels.2017.09.003. URL http://arxiv.org/abs/1704.08544. → pages76, 79, 80, 87
[77] X. Hua, H. Xu, Y. Yang, J. Zhu, P. Liu, and Y. Lu. DrGaP: A powerful tool for identifying driver genes and pathways incancer sequencing studies. American Journal of Human Genetics, 93(3):439–451, 2013. ISSN 00029297.doi:10.1016/j.ajhg.2013.07.003. URL https://www.ncbi.nlm.nih.gov/pubmed/23954162. → pages 2
[78] C. S. Hughes, S. Foehr, D. A. Garfield, E. E. Furlong, L. M. Steinmetz, and J. Krijgsveld. Ultrasensitive proteomeanalysis using paramagnetic bead technology. Molecular Systems Biology, 10(10):757–757, 2014. ISSN 1744-4292.doi:10.15252/msb.20145625. URL http://www.ncbi.nlm.nih.gov/pubmed/25358341. → pages 65
[79] C. S. Hughes, M. K. McConechy, D. R. Cochrane, T. Nazeran, A. N. Karnezis, D. G. Huntsman, and G. B. Morin.Quantitative Profiling of Single Formalin Fixed Tumour Sections: proteomics for translational research. ScientificReports, 6(1):34949, 2016. ISSN 2045-2322. doi:10.1038/srep34949. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27713570. → pages 65, 66
[80] F. Iorio, T. A. Knijnenburg, D. J. Vis, G. R. Bignell, M. P. Menden, M. Schubert, N. Aben, E. Goncalves, S. Barthorpe,H. Lightfoot, T. Cokelaer, P. Greninger, E. van Dyk, H. Chang, H. de Silva, H. Heyn, X. Deng, R. K. Egan, Q. Liu,T. Mironenko, X. Mitropoulos, L. Richardson, J. Wang, T. Zhang, S. Moran, S. Sayols, M. Soleimani, D. Tamborero,N. Lopez-Bigas, P. Ross-Macdonald, M. Esteller, N. S. Gray, D. A. Haber, M. R. Stratton, C. H. Benes, L. F. A.Wessels, J. Saez-Rodriguez, U. McDermott, and M. J. Garnett. A Landscape of Pharmacogenomic Interactions inCancer. Cell, 166(3):740–54, jul 2016. ISSN 1097-4172. doi:10.1016/j.cell.2016.06.017. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27397505. → pages 41, 44
[81] I. H. Ismail, R. Davidson, J.-P. Gagne, Z. Z. Xu, G. G. Poirier, and M. J. Hendzel. Germline mutations in BAP1 impairits function in DNA double-strand break repair. Cancer research, 74(16):4282–94, aug 2014. ISSN 1538-7445.doi:10.1158/0008-5472.CAN-13-3109. URL http://www.ncbi.nlm.nih.gov/pubmed/24894717. → pages 60
[82] P. F. Johnson. Molecular stop signs: regulation of cell-cycle arrest by C/EBP transcription factors. Journal of cellscience, 118(Pt 12):2545–55, jun 2005. ISSN 0021-9533. doi:10.1242/jcs.02459. URLhttp://www.ncbi.nlm.nih.gov/pubmed/15944395. → pages 84
[83] N. M. Joseph, Y.-y. Chen, A. Nasr, I. Yeh, E. Talevich, C. Onodera, B. C. Bastian, J. T. Rabban, K. Garg, C. Zaloudek,and D. A. Solomon. Genomic profiling of malignant peritoneal mesothelioma reveals recurrent alterations in epigeneticregulatory genes BAP1, SETD2, and DDX3X. Modern pathology : an official journal of the United States andCanadian Academy of Pathology, Inc, 30(2):246–254, 2017. ISSN 1530-0285. doi:10.1038/modpathol.2016.188. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27813512. → pages 51
[84] C. Kadoch and G. R. Crabtree. Mammalian SWI/SNF chromatin remodeling complexes and cancer: Mechanisticinsights gained from human genomics. Science Advances, 1(5):e1500447–e1500447, 2015. ISSN 2375-2548.doi:10.1126/sciadv.1500447. URL http://www.ncbi.nlm.nih.gov/pubmed/26601204. → pages 60
106
[85] S. Kato, B. N. Tomson, T. P. H. Buys, S. K. Elkin, J. L. Carter, and R. Kurzrock. Genomic Landscape of MalignantMesotheliomas. Molecular Cancer Therapeutics, 15(10):2498–2507, 2016. ISSN 1535-7163.doi:10.1158/1535-7163.MCT-16-0229. URL https://www.ncbi.nlm.nih.gov/pubmed/27507853. → pages 51, 53
[86] E. Khurana, Y. Fu, V. Colonna, X. J. Mu, H. M. Kang, T. Lappalainen, A. Sboner, L. Lochovsky, J. Chen, A. Harmanci,J. Das, A. Abyzov, S. Balasubramanian, K. Beal, D. Chakravarty, D. Challis, Y. Chen, D. Clarke, L. Clarke,F. Cunningham, U. S. Evani, P. Flicek, R. Fragoza, E. Garrison, R. Gibbs, Z. H. Gumus, J. Herrero, N. Kitabayashi,Y. Kong, K. Lage, V. Liluashvili, S. M. Lipkin, D. G. MacArthur, G. Marth, D. Muzny, T. H. Pers, G. R. S. Ritchie, J. a.Rosenfeld, C. Sisu, X. Wei, M. Wilson, Y. Xue, F. Yu, E. T. Dermitzakis, H. Yu, M. a. Rubin, C. Tyler-Smith, andM. Gerstein. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science (New York,N.Y.), 342(6154):1235587, 2013. ISSN 1095-9203. doi:10.1126/science.1235587. URLhttp://www.ncbi.nlm.nih.gov/pubmed/24092746. → pages 3
[87] Y.-A. Kim, S. Wuchty, and T. M. Przytycka. Identifying causal genes and dysregulated pathways in complex diseases.PLoS computational biology, 7(3):e1001095, Mar. 2011. ISSN 1553-7358. doi:10.1371/journal.pcbi.1001095. →pages 5
[88] Y.-A. Kim, R. Salari, S. Wuchty, and T. M. Przytycka. Module cover - a new approach to genotype-phenotype studies.Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 135–46, 2013. ISSN 2335-6936.URL http://www.ncbi.nlm.nih.gov/pubmed/23424119. → pages 76
[89] Y.-A. Kim, D.-Y. Cho, P. Dao, and T. M. Przytycka. MEMCover: integrated analysis of mutual exclusivity andfunctional network reveals dysregulated pathways across multiple cancer types. Bioinformatics (Oxford, England), 31(12):i284–92, jun 2015. ISSN 1367-4811. doi:10.1093/bioinformatics/btv247. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26072494. → pages 76, 88
[90] J. C. King, J. Xu, J. Wongvipat, H. Hieronymus, B. S. Carver, D. H. Leung, B. S. Taylor, C. Sander, R. D. Cardiff, S. S.Couto, W. L. Gerald, and C. L. Sawyers. Cooperativity of TMPRSS2-ERG with PI3-kinase pathway activation inprostate oncogenesis. Nature Genetics, 41(5):524–526, 2009. ISSN 1061-4036. doi:10.1038/ng.371. URLhttps://www.ncbi.nlm.nih.gov/pubmed/19396167. → pages 75
[91] S. Kohler, S. Bauer, D. Horn, and P. N. Robinson. Walking the Interactome for Prioritization of Candidate DiseaseGenes. American Journal of Human Genetics, 82(4):949–958, 2008. ISSN 00029297. doi:10.1016/j.ajhg.2008.02.013.URL http://www.cell.com/AJHG/abstract/S0002-9297(08)00172-9. → pages 6
[92] R. I. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of theNineteenth International Conference on Machine Learning, ICML ’02, pages 315–322, San Francisco, CA, USA,2002. Morgan Kaufmann Publishers Inc. ISBN 1-55860-873-7. URLhttp://dl.acm.org/citation.cfm?id=645531.655996. → pages 6
[93] K. J. Kron, A. Murison, S. Zhou, V. Huang, T. N. Yamaguchi, Y.-J. Shiah, M. Fraser, T. van der Kwast, P. C. Boutros,R. G. Bristow, and M. Lupien. TMPRSS2ERG fusion co-opts master transcription factors and activates NOTCHsignaling in primary prostate cancer. Nature Genetics, 49(9):1336–1345, 2017. ISSN 1061-4036.doi:10.1038/ng.3930. URL https://www.ncbi.nlm.nih.gov/pubmed/28783165. → pages 75
[94] A. Lan, I. Y. Smoly, G. Rapaport, S. Lindquist, E. Fraenkel, and E. Yeger-Lotem. ResponseNet: Revealing signalingand regulatory networks linking genetic and transcriptomic screening data. Nucleic Acids Research, 39(SUPPL. 2):424–429, 2011. ISSN 03051048. doi:10.1093/nar/gkr359. → pages 6
[95] S. Landreville, O. A. Agapova, K. A. Matatall, Z. T. Kneass, M. D. Onken, R. S. Lee, A. M. Bowcock, and J. W.Harbour. Histone deacetylase inhibitors induce growth arrest and differentiation in uveal melanoma. Clinical CancerResearch, 18(2):408–416, 2012. ISSN 10780432. doi:10.1158/1078-0432.CCR-11-0946. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22038994. → pages 60
107
[96] M. S. Lawrence, P. Stojanov, P. Polak, G. V. Kryukov, K. Cibulskis, A. Sivachenko, S. L. Carter, C. Stewart, C. H.Mermel, S. a. Roberts, A. Kiezun, P. S. Hammerman, A. McKenna, Y. Drier, L. Zou, A. H. Ramos, T. J. Pugh,N. Stransky, E. Helman, J. Kim, C. Sougnez, L. Ambrogio, E. Nickerson, E. Shefler, M. L. Cortes, D. Auclair,G. Saksena, D. Voet, M. Noble, D. DiCara, P. Lin, L. Lichtenstein, D. I. Heiman, T. Fennell, M. Imielinski,B. Hernandez, E. Hodis, S. Baca, A. M. Dulak, J. Lohr, D.-A. Landau, C. J. Wu, J. Melendez-Zajgla,A. Hidalgo-Miranda, A. Koren, S. a. McCarroll, J. Mora, R. S. Lee, B. Crompton, R. Onofrio, M. Parkin, W. Winckler,K. Ardlie, S. B. Gabriel, C. W. M. Roberts, J. a. Biegel, K. Stegmaier, A. J. Bass, L. a. Garraway, M. Meyerson, T. R.Golub, D. a. Gordenin, S. Sunyaev, E. S. Lander, and G. Getz. Mutational heterogeneity in cancer and the search fornew cancer-associated genes. Nature, 499(7457):214–8, July 2013. ISSN 1476-4687. doi:10.1038/nature12213. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23770567. → pages 2
[97] D. T. Le, J. N. Uram, H. Wang, B. R. Bartlett, H. Kemberling, A. D. Eyring, A. D. Skora, B. S. Luber, N. S. Azad,D. Laheru, B. Biedrzycki, R. C. Donehower, A. Zaheer, G. A. Fisher, T. S. Crocenzi, J. J. Lee, S. M. Duffy, R. M.Goldberg, A. de la Chapelle, M. Koshiji, F. Bhaijee, T. Huebner, R. H. Hruban, L. D. Wood, N. Cuka, D. M. Pardoll,N. Papadopoulos, K. W. Kinzler, S. Zhou, T. C. Cornish, J. M. Taube, R. A. Anders, J. R. Eshleman, B. Vogelstein, andL. A. Diaz. PD-1 Blockade in Tumors with Mismatch-Repair Deficiency. New England Journal of Medicine, 372(26):2509–2520, 2015. ISSN 0028-4793. doi:10.1056/NEJMoa1500596. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26028255. → pages 60
[98] D. T. Le, J. N. Durham, K. N. Smith, H. Wang, B. R. Bartlett, L. K. Aulakh, S. Lu, H. Kemberling, C. Wilt, B. S. Luber,F. Wong, N. S. Azad, A. A. Rucki, D. Laheru, R. Donehower, A. Zaheer, G. A. Fisher, T. S. Crocenzi, J. J. Lee, T. F.Greten, A. G. Duffy, K. K. Ciombor, A. D. Eyring, B. H. Lam, A. Joe, S. P. Kang, M. Holdhoff, L. Danilova, L. Cope,C. Meyer, S. Zhou, R. M. Goldberg, D. K. Armstrong, K. M. Bever, A. N. Fader, J. Taube, F. Housseau, D. Spetzler,N. Xiao, D. M. Pardoll, N. Papadopoulos, K. W. Kinzler, J. R. Eshleman, B. Vogelstein, R. A. Anders, and L. A. Diaz.Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science (New York, N.Y.), 357(6349):409–413, 2017. ISSN 1095-9203. doi:10.1126/science.aan6733. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28596308. → pages 60
[99] N. Leblay, F. Lepretre, N. Le Stang, A. Gautier-Stein, L. Villeneuve, S. Isaac, D. Maillet, F. Galateau-Salle, C. Villenet,S. Sebda, A. Goracci, G. Byrnes, J. D. McKay, M. Figeac, O. Glehen, F. N. Gilly, M. Foll, L. Fernandez-Cuesta, andM. Brevet. BAP1 Is Altered by Copy Number Loss, Mutation, and/or Loss of Protein Expression in More Than 70%ofMalignant Peritoneal Mesotheliomas. Journal of Thoracic Oncology, 12(4):724–733, 2017. ISSN 15561380.doi:10.1016/j.jtho.2016.12.019. URL https://www.ncbi.nlm.nih.gov/pubmed/28034829. → pages 51
[100] B. D. Lehmann, J. A. Bauer, X. Chen, M. E. Sanders, A. B. Chakravarthy, Y. Shyr, and J. A. Pietenpol. Identification ofhuman triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. The Journal ofclinical investigation, 121(7):2750–67, jul 2011. ISSN 1558-8238. doi:10.1172/JCI45014. → pages 41
[101] M. D. M. Leiserson, D. Blokh, R. Sharan, and B. J. Raphael. Simultaneous identification of multiple driver pathways incancer. PLoS computational biology, 9(5):e1003054, May 2013. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003054.→ pages 4
[102] M. D. M. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V. Eldridge, J. L. Thomas, A. Papoutsaki, Y. Kim, B. Niu,M. McLellan, M. S. Lawrence, A. Gonzalez-Perez, D. Tamborero, Y. Cheng, G. A. Ryslik, N. Lopez-Bigas, G. Getz,L. Ding, and B. J. Raphael. Pan-cancer network analysis identifies combinations of rare somatic mutations acrosspathways and protein complexes. Nature Genetics, 47(2):106–114, 2014. ISSN 1061-4036. doi:10.1038/ng.3168.URL http://www.ncbi.nlm.nih.gov/pubmed/25501392. → pages 76
[103] C. K.-S. Leung. Anti-monotone Constraints, pages 98–98. Springer US, Boston, MA, 2009. ISBN 978-0-387-39940-9.doi:10.1007/978-0-387-39940-9 5046. URL https://doi.org/10.1007/978-0-387-39940-9 5046. → pages 80
[104] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2008. →pages 12
108
[105] W. Li, J. Cooper, L. Zhou, C. Yang, H. Erdjument-Bromage, D. Zagzag, M. Snuderl, M. Ladanyi, C. O. Hanemann,P. Zhou, M. A. Karajannis, and F. G. Giancotti. Merlin/NF2 loss-driven tumorigenesis linked toCRL4(DCAF1)-mediated inhibition of the hippo pathway kinases Lats1 and 2 in the nucleus. Cancer cell, 26(1):48–60, jul 2014. ISSN 1878-3686. doi:10.1016/j.ccr.2014.05.001. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25026211. → pages 55
[106] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Societyfor Information Science and Technology, 58(7):1019–1031, 2007. ISSN 15322882. doi:10.1002/asi.20591. → pages 7,12
[107] F. Lin, M. C. De Gooijer, E. M. Roig, L. C. M. Buil, S. M. Christner, J. H. Beumer, T. WEurdinger, J. H. Beijnen, andO. Van Tellingen. ABCB1, ABCG2, and PTEN determine the response of glioblastoma to temozolomide and ABT-888therapy. Clinical Cancer Research, 20(10):2703–2713, 2014. ISSN 15573265. doi:10.1158/1078-0432.CCR-14-0084.→ pages 38
[108] A. A. Loboda, M. N. Artyomov, and A. A. S. B. Solving Generalized Maximum-Weight Connected Subgraph Problemfor Network Enrichment Analysis. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms in Bioinformatics,pages 210–221, Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4.doi:10.1007/978-3-319-43681-4 17. URL http://link.springer.com/10.1007/978-3-642-33122-0. → pages 76
[109] I. S. U. Luk, R. Shrestha, H. Xue, Y. Wang, F. Zhang, D. Lin, A. Haegert, R. Wu, X. Dong, C. C. Collins, A. Zoubeidi,M. E. Gleave, P. W. Gout, and Y. Wang. BIRC6 Targeting as Potential Therapy for Advanced, Enzalutamide-ResistantProstate Cancer. Clinical cancer research, 23(6):1542–1551, mar 2017. ISSN 1078-0432.doi:10.1158/1078-0432.CCR-16-0718. URL http://www.ncbi.nlm.nih.gov/pubmed/27663589. → pages 9
[110] M. Maio, A. Scherpereel, L. Calabro, J. Aerts, S. C. Perez, A. Bearz, K. Nackaerts, D. A. Fennell, D. Kowalski, A. S.Tsao, P. Taylor, F. Grosso, S. J. Antonia, A. K. Nowak, M. Taboada, M. Puglisi, P. K. Stockman, and H. L. Kindler.Tremelimumab as second-line or third-line treatment in relapsed malignant mesothelioma (DETERMINE): amulticentre, international, randomised, double-blind, placebo-controlled phase 2b trial. The Lancet Oncology, pages1–13, 2017. ISSN 14702045. doi:10.1016/S1470-2045(17)30446-1. URLhttps://www.ncbi.nlm.nih.gov/pubmed/28729154. → pages 51, 61
[111] J. Marquart, E. Y. Chen, and V. Prasad. Estimation of The Percentage of US Patients With Cancer Who Benefit FromGenome-Driven Oncology. JAMA Oncology, 97239:1–7, apr 2018. ISSN 2374-2437.doi:10.1001/jamaoncol.2018.1660. URL http://dx.doi.org/10.1001/jamaoncol.2018.1660. → pages 43
[112] D. L. Masica and R. Karchin. Correlation of somatic mutation and expression identifies genes important in humanglioblastoma progression and survival. Cancer research, 71(13):4550–61, July 2011. ISSN 1538-7445.doi:10.1158/0008-5472.CAN-11-0180. → pages 4
[113] S. Maxwell, M. R. Chance, and M. Koyuturk. Efficiently Enumerating All Connected Induced Subgraphs of a LargeMolecular Network. In A.-H. Dediu, , C. Mart\’in-Vide, , and B. Truthe, editors, Algorithms for ComputationalBiology, pages 171–182, Cham, 2014. Springer International Publishing. ISBN 978-3-319-07953-0.doi:10.1007/978-3-319-07953-0 14. URL http://link.springer.com/10.1007/978-3-319-07953-0 14. → pages 80
[114] A. McPherson, F. Hormozdiari, A. Zayed, R. Giuliany, G. Ha, M. G. F. Sun, M. Griffith, A. Heravi Moussavi, J. Senz,N. Melnyk, M. Pacheco, M. A. Marra, M. Hirst, T. O. Nielsen, S. C. Sahinalp, D. Huntsman, and S. P. Shah. deFuse: analgorithm for gene fusion discovery in tumor RNA-Seq data. PLoS computational biology, 7(5):e1001138, may 2011.ISSN 1553-7358. doi:10.1371/journal.pcbi.1001138. URL http://www.ncbi.nlm.nih.gov/pubmed/21625565. → pages55, 65
109
[115] C. H. Mermel, S. E. Schumacher, B. Hill, M. L. Meyerson, R. Beroukhim, and G. Getz. GISTIC2.0 facilitates sensitiveand confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biology,12(4):R41, 2011. ISSN 1465-6906. doi:10.1186/gb-2011-12-4-r41. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21527027. → pages 3, 54
[116] D. Miao, C. A. Margolis, W. Gao, M. H. Voss, W. Li, D. J. Martini, C. Norton, D. Bosse, S. M. Wankowicz, D. Cullen,C. Horak, M. Wind-Rotolo, A. Tracy, M. Giannakis, F. S. Hodi, C. G. Drake, M. W. Ball, M. E. Allaf, A. Snyder, M. D.Hellmann, T. Ho, R. J. Motzer, S. Signoretti, W. G. Kaelin, T. K. Choueiri, and E. M. Van Allen. Genomic correlates ofresponse to immune checkpoint therapies in clear cell renal cell carcinoma. Science (New York, N.Y.), 5951(January):1–11, jan 2018. ISSN 1095-9203. doi:10.1126/science.aan5951. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29301960. → pages 60
[117] C. a. Miller, S. H. Settle, E. P. Sulman, K. D. Aldape, and A. Milosavljevic. Discovering functional modules byidentifying recurrent and mutually exclusive mutational patterns in tumors. BMC medical genomics, 4(1):34, 2011.ISSN 1755-8794. doi:10.1186/1755-8794-4-34. URL http://www.biomedcentral.com/1755-8794/4/34. → pages 4, 76
[118] M. Mina, F. Raynaud, D. Tavernari, E. Battistello, S. Sungalee, S. Saghafinia, T. Laessle, F. Sanchez-Vega, N. Schultz,E. Oricchio, and G. Ciriello. Conditional Selection of Genomic Alterations Dictates Cancer Evolution and OncogenicDependencies. Cancer cell, 29(0):723–736, jul 2017. ISSN 1878-3686. doi:10.1016/j.ccell.2017.06.010. → pages 76
[119] G. Minuti and L. Landi. MET deregulation in breast cancer. Annals of translational medicine, 3(13):181, aug 2015.ISSN 2305-5839. doi:10.3978/j.issn.2305-5839.2015.06.22. URL http://www.ncbi.nlm.nih.gov/pubmed/26366398. →pages 84
[120] L. Montanaro, D. Trere, and M. Derenzini. Nucleolus, ribosomes, and cancer. The American journal of pathology, 173(2):301–10, aug 2008. ISSN 1525-2191. doi:10.2353/ajpath.2008.070752. URLhttp://www.ncbi.nlm.nih.gov/pubmed/18583314. → pages 85
[121] K. W. Mouw, M. S. Goldberg, P. A. Konstantinopoulos, and A. D. D’Andrea. DNA Damage and Repair Biomarkers ofImmunotherapy Response. Cancer discovery, 7(7):675–693, 2017. ISSN 2159-8290.doi:10.1158/2159-8290.CD-17-0226. URL http://www.ncbi.nlm.nih.gov/pubmed/28630051. → pages 60
[122] A. Murat, E. Migliavacca, T. Gorlia, W. L. Lambiv, T. Shay, M.-F. Hamou, N. de Tribolet, L. Regli, W. Wick, M. C. M.Kouwenhoven, J. a. Hainfellner, F. L. Heppner, P.-Y. Dietrich, Y. Zimmer, J. G. Cairncross, R.-c. Janzer, E. Domany,M. Delorenzi, R. Stupp, and M. E. Hegi. Stem cell-related ”self-renewal” signature and high epidermal growth factorreceptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. Journal of clinicaloncology, 26(18):3015–24, jun 2008. ISSN 1527-7755. doi:10.1200/JCO.2007.15.7164. URLhttp://www.ncbi.nlm.nih.gov/pubmed/18565887. → pages 24
[123] S. C. Muthukrishnan, S. and Sahinalp. Approximate nearest neighbors and sequence comparison with block operations.In Proceedings of the Thirty-second Annual ACM Symposium on Theory of Computing, pages 416–424, New York,2000. ACM. ISBN 1581131844. doi:10.1145/335305.335353. → pages 78
[124] A. M. Newman, C. L. Liu, M. R. Green, A. J. Gentles, W. Feng, Y. Xu, C. D. Hoang, M. Diehn, and A. A. Alizadeh.Robust enumeration of cell subsets from tissue expression profiles. Nature methods, 12(5):453–7, may 2015. ISSN1548-7105. doi:10.1038/nmeth.3337. URL http://www.ncbi.nlm.nih.gov/pubmed/25822800. → pages 58, 68
[125] S. Ng, E. a. Collisson, A. Sokolov, T. Goldstein, A. Onzalez-Perez, N. Lopez-Bigas, C. Benz, D. Haussler, and J. M.Stuart. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis.Bioinformatics, 28(18):640–646, 2012. ISSN 13674803. doi:10.1093/bioinformatics/bts402. → pages 4
110
[126] C. K. Osborne, V. Bardou, T. A. Hopp, G. C. Chamness, S. G. Hilsenbeck, S. A. W. Fuqua, J. Wong, D. C. Allred,G. M. Clark, and R. Schiff. Role of the estrogen receptor coactivator AIB1 (SRC-3) and HER-2/neu in tamoxifenresistance in breast cancer. Journal of the National Cancer Institute, 95(5):353–61, mar 2003. ISSN 0027-8874.doi:10.1017/CBO9781107415324.004. URL http://www.ncbi.nlm.nih.gov/pubmed/12618500. → pages 40
[127] D. Pan, A. Kobayashi, P. Jiang, L. Ferrari de Andrade, R. E. Tay, A. Luoma, D. Tsoucas, X. Qiu, K. Lim, P. Rao, H. W.Long, G.-c. Yuan, J. Doench, M. Brown, S. Liu, and K. W. Wucherpfennig. A major chromatin regulator determinesresistance of tumor cells to T cell-mediated killing. Science (New York, N.Y.), 1710(January):1–12, jan 2018. ISSN1095-9203. doi:10.1126/science.aao1710. URL http://www.ncbi.nlm.nih.gov/pubmed/29301958. → pages 60
[128] D. W. Parsons, S. Jones, X. Zhang, J. C.-H. Lin, R. J. Leary, P. Angenendt, et al. An integrated genomic analysis ofhuman glioblastoma multiforme. Science (New York, N.Y.), 321(5897):1807–12, Sept. 2008. ISSN 1095-9203.doi:10.1126/science.1164382. URL https://www.ncbi.nlm.nih.gov/pubmed/18772396. → pages 3, 35
[129] A.-M. Patch, E. L. Christie, D. Etemadmoghadam, D. W. Garsed, J. George, S. Fereday, K. Nones, P. Cowin, K. Alsop,P. J. Bailey, K. S. Kassahn, F. Newell, M. C. J. Quinn, S. Kazakoff, K. Quek, C. Wilhelm-Benartzi, E. Curry, H. S.Leong, A. Hamilton, L. Mileshkin, G. Au-Yeung, C. Kennedy, J. Hung, Y.-E. Chiew, P. Harnett, M. Friedlander,M. Quinn, J. Pyman, S. Cordner, P. OBrien, J. Leditschke, G. Young, K. Strachan, P. Waring, W. Azar, C. Mitchell,N. Traficante, J. Hendley, H. Thorne, M. Shackleton, D. K. Miller, G. M. Arnau, R. W. Tothill, T. P. Holloway,T. Semple, I. Harliwong, C. Nourse, E. Nourbakhsh, S. Manning, S. Idrisoglu, T. J. C. Bruxner, A. N. Christ, B. Poudel,O. Holmes, M. Anderson, C. Leonard, A. Lonie, N. Hall, S. Wood, D. F. Taylor, Q. Xu, J. L. Fink, N. Waddell,R. Drapkin, E. Stronach, H. Gabra, R. Brown, A. Jewell, S. H. Nagaraj, E. Markham, P. J. Wilson, J. Ellul, O. McNally,M. a. Doyle, R. Vedururu, C. Stewart, E. Lengyel, J. V. Pearson, N. Waddell, A. DeFazio, S. M. Grimmond, andD. D. L. Bowtell. Wholegenome characterization of chemoresistant ovarian cancer. Nature, 521(7553):489–494, 2015.ISSN 0028-0836. doi:10.1038/nature14410. → pages 36
[130] E. O. Paull, D. E. Carlin, M. Niepel, P. K. Sorger, D. Haussler, et al. Discovering causal pathways linking genomicevents to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics (Oxford,England), pages 1–8, Sept. 2013. ISSN 1367-4811. doi:10.1093/bioinformatics/btt471. → pages 6
[131] J. Pelletier, G. Thomas, and S. Volarevic. Ribosome biogenesis in cancer: new players and therapeutic avenues. Naturereviews. Cancer, 18(1):51–63, jan 2018. ISSN 1474-1768. doi:10.1038/nrc.2017.104. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29192214. → pages 85
[132] S. Pena-Llopis, S. Vega-Rubın-de Celis, A. Liao, N. Leng, A. Pavıa-Jimenez, S. Wang, T. Yamasaki, L. Zhrebker,S. Sivanand, P. Spence, L. Kinch, T. Hambuch, S. Jain, Y. Lotan, V. Margulis, A. I. Sagalowsky, P. B. Summerour,W. Kabbani, S. W. W. Wong, N. Grishin, M. Laurent, X.-J. Xie, C. D. Haudenschild, M. T. Ross, D. R. Bentley,P. Kapur, and J. Brugarolas. BAP1 loss defines a new class of renal cell carcinoma. Nature Genetics, 44(7):751–759,2012. ISSN 1061-4036. doi:10.1038/ng.2323. URL https://www.ncbi.nlm.nih.gov/pubmed/22683710. → pages 60
[133] C. M. Perou, T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen,L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lønning, A. L. Børresen-Dale, P. O. Brown,and D. Botstein. Molecular portraits of human breast tumours. Nature, 406(6797):747–52, aug 2000. ISSN 0028-0836.doi:10.1038/35021093. → pages 88
[134] T. S. K. Prasad, K. Kandasamy, and A. Pandey. Human Protein Reference Database and Human Proteinpedia asdiscovery tools for systems biology. Methods in molecular biology (Clifton, N.J.), 577:67–79, jan 2009. ISSN1940-6029. doi:10.1007/978-1-60761-232-2 6. URL http://www.ncbi.nlm.nih.gov/pubmed/19718509. → pages 24
[135] V. Prasad. Perspective: The precision-oncology illusion. Nature, 537(7619):S63–S63, Sep 2016. ISSN 0028-0836.URL http://dx.doi.org/10.1038/537S63a. Outlook. → pages 43
111
[136] Y. Qi, Y. Suhail, Y.-y. Lin, J. D. Boeke, and J. S. Bader. Finding friends and enemies in an enemies-only network: agraph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast geneticinteractions. Genome research, 18(12):1991–2004, Dec. 2008. ISSN 1088-9051. doi:10.1101/gr.077693.108. → pages6
[137] J. Reimand and G. D. Bader. Systematic analysis of somatic mutations in phosphorylation signaling predicts novelcancer drivers. Molecular systems biology, 9(637):637, 2013. ISSN 1744-4292. doi:10.1038/msb.2012.68. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23340843. → pages 3
[138] S. Ren, G.-H. Wei, D. Liu, L. Wang, Y. Hou, S. Zhu, L. Peng, Q. Zhang, Y. Cheng, H. Su, X. Zhou, J. Zhang, F. Li,H. Zheng, Z. Zhao, C. Yin, Z. He, X. Gao, H. E. Zhau, C.-Y. Chu, J. B. Wu, C. Collins, S. V. Volik, R. Bell, J. Huang,K. Wu, D. Xu, D. Ye, Y. Yu, L. Zhu, M. Qiao, H.-M. Lee, Y. Yang, Y. Zhu, X. Shi, R. Chen, Y. Wang, W. Xu, Y. Cheng,C. Xu, X. Gao, T. Zhou, B. Yang, J. Hou, L. Liu, Z. Zhang, Y. Zhu, C. Qin, P. Shao, J. Pang, L. W. Chung, J. Xu, C.-L.Wu, W. Zhong, X. Xu, Y. Li, X. Zhang, J. Wang, H. Yang, J. Wang, H. Huang, and Y. Sun. Whole-genome andTranscriptome Sequencing of Prostate Cancer Identify New Genetic Alterations Driving Disease Progression.European Urology, 73(3):322–339, mar 2018. ISSN 03022838. doi:10.1016/j.eururo.2017.08.027. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28927585. → pages 24
[139] B. Reva, Y. Antipin, and C. Sander. Predicting the functional impact of protein mutations: Application to cancergenomics. Nucleic Acids Research, 39(17):37–43, 2011. ISSN 03051048. doi:10.1093/nar/gkr407. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21727090. → pages 3
[140] A. L. Richardson, Z. C. Wang, A. De Nicolo, X. Lu, M. Brown, A. Miron, X. Liao, J. D. Iglehart, D. M. Livingston,and S. Ganesan. X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell, 9(2):121–132, 2006.ISSN 15356108. doi:10.1016/j.ccr.2006.01.013. → pages 24
[141] D. S. Rickman, T. D. Soong, B. Moss, J. M. Mosquera, J. Dlabal, S. Terry, T. MacDonald, K. Bunting, F. Demichelis,A. Melnick, O. Elemento, and M. a. Rubin. Oncogene-mediated alterations in chromatin conformation. Proceedings ofthe National Academy of Sciences of the United States of America, 109(23):9083–9088, 2012. ISSN 0008-5472. →pages 24
[142] A. Robertson, J. Shih, C. Yau, E. Gibb, J. Oba, K. Mungall, J. Hess, V. Uzunangelov, V. Walter, L. Danilova,T. Lichtenberg, M. Kucherlapati, P. Kimes, M. Tang, A. Penson, O. Babur, R. Akbani, C. Bristow, K. Hoadley, L. Iype,M. Chang, M. Abdel-Rahman, R. Akbani, A. Ally, J. Auman, O. Babur, M. Balasundaram, S. Balu, C. Benz,R. Beroukhim, I. Birol, T. Bodenheimer, J. Bowen, R. Bowlby, C. Bristow, D. Brooks, R. Carlsen, C. Cebulla,M. Chang, A. Cherniack, L. Chin, J. Cho, E. Chuah, S. Chudamani, C. Cibulskis, K. Cibulskis, L. Cope, S. Coupland,L. Danilova, T. Defreitas, J. Demchok, L. Desjardins, N. Dhalla, B. Esmaeli, I. Felau, M. Ferguson, S. Frazer,S. Gabriel, J. Gastier-Foster, N. Gehlenborg, M. Gerken, J. Gershenwald, G. Getz, E. Gibb, K. Griewank, E. Grimm,D. Hayes, A. Hegde, D. Heiman, C. Helsel, J. Hess, K. Hoadley, S. Hobensack, R. Holt, A. Hoyle, X. Hu, C. Hutter,M. Jager, S. Jefferys, C. Jones, S. Jones, C. Kandoth, K. Kasaian, J. Kim, P. Kimes, M. Kucherlapati, R. Kucherlapati,E. Lander, M. Lawrence, A. Lazar, S. Lee, K. Leraas, T. Lichtenberg, P. Lin, J. Liu, W. Liu, L. Lolla, Y. Lu, L. Iype,Y. Ma, H. Mahadeshwar, O. Mariani, M. Marra, M. Mayo, S. Meier, S. Meng, M. Meyerson, P. Mieczkowski, G. Mills,R. Moore, L. Mose, A. Mungall, K. Mungall, B. Murray, R. Naresh, M. Noble, J. Oba, A. Pantazi, M. Parfenov, P. Park,J. Parker, A. Penson, C. Perou, T. Pihl, R. Pilarski, A. Protopopov, A. Radenbaugh, K. Rai, N. Ramirez, X. Ren,S. Reynolds, J. Roach, A. Robertson, S. Roman-Roman, J. Roszik, S. Sadeghi, G. Saksena, X. Sastre, D. Schadendorf,J. Schein, L. Schoenfield, S. Schumacher, J. Seidman, S. Seth, G. Sethi, M. Sheth, Y. Shi, C. Shields, J. Shih,I. Shmulevich, J. Simons, A. Singh, P. Sipahimalani, T. Skelly, H. Sofia, M. Soloway, X. Song, M.-H. Stern, J. Stuart,Q. Sun, H. Sun, A. Tam, D. Tan, M. Tang, J. Tang, R. Tarnuzzer, B. Taylor, N. Thiessen, V. Thorsson, K. Tse,V. Uzunangelov, U. Veluvolu, R. Verhaak, D. Voet, V. Walter, Y. Wan, Z. Wang, J. Weinstein, M. Wilkerson,M. Williams, L. Wise, S. Woodman, T. Wong, Y. Wu, L. Yang, L. Yang, C. Yau, J. Zenklusen, J. Zhang, H. Zhang,E. Zmuda, A. Cherniack, C. Benz, G. Mills, R. Verhaak, K. Griewank, I. Felau, J. Zenklusen, J. Gershenwald,L. Schoenfield, A. Lazar, M. Abdel-Rahman, S. Roman-Roman, M.-H. Stern, C. Cebulla, M. Williams, M. Jager,
112
S. Coupland, B. Esmaeli, C. Kandoth, and S. Woodman. Integrative Analysis Identifies Four Molecular and ClinicalSubsets in Uveal Melanoma. Cancer Cell, 32(2):204–220, 2017. ISSN 18783686. doi:10.1016/j.ccell.2017.07.003.URL https://www.ncbi.nlm.nih.gov/pubmed/28810145. → pages 60, 75
[143] R. Rosenthal, N. McGranahan, J. Herrero, B. S. Taylor, and C. Swanton. deconstructSigs: delineating mutationalprocesses in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. GenomeBiology, 17(1):31, 2016. ISSN 1474-760X. doi:10.1186/s13059-016-0893-4. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26899170. → pages 52, 67
[144] B. Rosner. Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics), 25(2):165–172, 2013.→ pages 24, 83
[145] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo, G. Frishman, C. Montrone, and H. W.Mewes. CORUM: The comprehensive resource of mammalian protein complexes-2009. Nucleic Acids Research, 38(SUPPL.1):497–501, 2009. ISSN 03051048. doi:10.1093/nar/gkp914. URLhttp://www.ncbi.nlm.nih.gov/pubmed/19884131. → pages 57
[146] J. J. Sacco, J. Kenyani, Z. Butt, R. Carter, H. Y. Chew, L. P. Cheeseman, S. Darling, M. Denny, S. Urbe, M. J. Clague,and J. M. Coulson. Loss of the deubiquitylase BAP1 alters class I histone deacetylase expression and sensitivity ofmesothelioma cells to HDAC inhibitors. Oncotarget, 6(15):13757–71, 2015. ISSN 1949-2553.doi:10.18632/oncotarget.3765. URL http://www.ncbi.nlm.nih.gov/pubmed/25970771. → pages 60
[147] F. Sanchez-Garcia, U. D. Akavia, E. Mozes, and D. Pe’er. JISTIC: identification of significant targets in cancer. BMCbioinformatics, 11:189, 2010. ISSN 1471-2105. doi:10.1186/1471-2105-11-189. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20398270. → pages 3
[148] R. F. Schwarz, A. Trinh, B. Sipos, J. D. Brenton, N. Goldman, and F. Markowetz. Phylogenetic quantification ofintra-tumour heterogeneity. PLoS computational biology, 10(4):e1003535, apr 2014. ISSN 1553-7358.doi:10.1371/journal.pcbi.1003535. URL http://www.ncbi.nlm.nih.gov/pubmed/24743184. → pages 78
[149] H. Sharifi-Noghabi, Y. Liu, N. Erho, R. Shrestha, M. Alshalalfa, E. Davicioni, C. C. Collins, and M. Ester. Deepgenomic signature for early metastasis prediction in prostate cancer. bioRxiv, 2018. doi:10.1101/276055. URLhttps://doi.org/10.1101/276055. → pages 9
[150] N. L. Sharma, C. E. Massie, A. Ramos-Montoya, V. Zecchini, H. E. Scott, A. D. Lamb, S. MacArthur, R. Stark, A. Y.Warren, I. G. Mills, and D. E. Neal. The Androgen Receptor Induces a Distinct Transcriptional Program inCastration-Resistant Prostate Cancer in Man. Cancer Cell, 23(1):35–47, 2013. ISSN 15356108.doi:10.1016/j.ccr.2012.11.010. → pages 24
[151] B. S. Sheffield, A. V. Tinker, Y. Shen, H. Hwang, H. H. Li-Chang, E. Pleasance, C. Ch’ng, A. Lum, J. Lorette, Y. J.McConnell, S. Sun, S. J. Jones, A. M. Gown, D. G. Huntsman, D. F. Schaeffer, A. Churg, S. Yip, J. Laskin, and M. A.Marra. Personalized oncogenomics: Clinical experience with malignant peritoneal mesothelioma using whole genomesequencing. PLoS ONE, 10(3):1–12, 2015. ISSN 19326203. doi:10.1371/journal.pone.0119689. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25798586. → pages 51
[152] M. F. Shlesinger. Mathematical physics: first encounters. Nature, 450(7166):40–41, 2007. ISSN 0028-0836.doi:10.1038/450040a. → pages 7
[153] I. Shmulevich, E. R. Dougherty, and W. Zhang. Gene perturbation and intervention in probabilistic Boolean networks.Bioinformatics (Oxford, England), 18(10):1319–1331, 2002. ISSN 1367-4803, 1460-2059.doi:10.1093/bioinformatics/18.10.1319. → pages 7
113
[154] R. Shrestha, E. Hodzic, J. Yeung, K. Wang, T. Sauerwald, P. Dao, S. Anderson, H. Beltran, M. A. Rubin, C. C. Collins,G. Haffari, and S. C. Sahinalp. HIT’nDRIVE: Multi-driver gene prioritization based on hitting time. Research inComputational Molecular Biology: 18th Annual International Conference, RECOMB 2014, Pittsburgh, PA, USA, April2-5, 2014, Proceedings, pages 293–306, 2014. doi:10.1007/978-3-319-05269-4 23. URLhttp://dx.doi.org/10.1007/978-3-319-05269-4 23. → pages 7, 11
[155] R. Shrestha, E. Hodzic, T. Sauerwald, P. Dao, K. Wang, J. Yeung, S. Anderson, F. Vandin, G. Haffari, C. C. Collins, andS. C. Sahinalp. HIT’nDRIVE: patient-specific multidriver gene prioritization for precision oncology. Genome research,27(9):1573–1588, sep 2017. ISSN 1549-5469. doi:10.1101/gr.221218.117. URLhttps://www.ncbi.nlm.nih.gov/pubmed/28768687. → pages 7, 11, 52, 67, 75
[156] R. Shrestha, N. Nabavi, Y.-Y. Lin, F. Mo, S. Anderson, S. Volik, H. H. Adomat, D. Lin, H. Xue, X. Dong, R. Shukin,R. H. Bell, B. McConeghy, A. Haegert, S. Brahmbhatt, E. Li, H. Z. Oo, A. Hurtado-Coll, L. Fazli, J. Zhou,Y. McConnell, A. McCart, A. Lowy, G. B. Morin, M. Daugaard, S. C. Sahinalp, F. Hach, S. Le Bihan, M. E. Gleave,Y. Wang, A. Churg, and C. C. Collins. Integrated Multi-omics Molecular Subtyping Predicts Therapeutic Vulnerabilityin Malignant Peritoneal Mesothelioma. bioRxiv, 2018. doi:10.1101/243477. URL https://doi.org/10.1101/2434777. →pages 8, 51
[157] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng. SIFT web server: predicting effects of amino acidsubstitutions on proteins. Nucleic acids research, 40(Web Server issue):W452–7, 2012. ISSN 1362-4962.doi:10.1093/nar/gks539. URL https://www.ncbi.nlm.nih.gov/pubmed/22689647. → pages 3
[158] A. D. Singhi, A. M. Krasinskas, H. A. Choudry, D. L. Bartlett, J. F. Pingpank, H. J. Zeh, A. Luvison, K. Fuhrer,N. Bahary, R. R. Seethala, and S. Dacic. The prognostic significance of BAP1, NF2, and CDKN2A in malignantperitoneal mesothelioma. Modern pathology : an official journal of the United States and Canadian Academy ofPathology, Inc, 29(1):14–24, 2016. ISSN 1530-0285. doi:10.1038/modpathol.2015.121. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26493618. → pages 51
[159] T. Sjoblom, L. D. Wood, D. W. Parsons, J. Lin, T. D. Barber, D. Mandelker, R. J. Leary, J. Ptak, N. Silliman, S. Szabo,P. Buckhaults, C. Farrell, P. Meeh, S. D. Markowitz, J. Willis, D. Dawson, J. K. V. Willson, A. F. Gazdar, J. Hartigan,L. Wu, C. Liu, G. Parmigiani, B. H. Park, and K. E. Bachman. The Consensus Coding Sequences of Human Breast andColorectal Cancers. Science, 314(October):268–274, 2006. ISSN 0036-8075, 1095-9203.doi:10.1126/science.1133427. URL https://www.ncbi.nlm.nih.gov/pubmed/16959974. → pages 2
[160] M. R. Spalinger, R. Manzini, L. Hering, J. B. Riggs, C. Gottier, S. Lang, K. Atrott, A. Fettelschoss, F. Olomski, T. M.Kundig, M. Fried, D. F. McCole, G. Rogler, and M. Scharl. PTPN2 Regulates Inflammasome Activation and ControlsOnset of Intestinal Inflammation and Colon Cancer. Cell reports, 22(7):1835–1848, feb 2018. ISSN 2211-1247.doi:10.1016/j.celrep.2018.01.052. → pages 86
[161] M. R. Stratton, P. J. Campbell, and P. A. Futreal. The cancer genome. Nature, 458(7239):719–24, Apr. 2009. ISSN1476-4687. doi:10.1038/nature07943. URL https://www.ncbi.nlm.nih.gov/pubmed/19360079. → pages 1, 10
[162] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy,T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach forinterpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States ofAmerica, 102(43):15545–50, oct 2005. ISSN 0027-8424. doi:10.1073/pnas.0506580102. → pages 44, 68, 89
[163] P. H. Sugarbaker and D. Chang. Long-term regional chemotherapy for patients with epithelial malignant peritonealmesothelioma results in improved survival. European Journal of Surgical Oncology, 43(7):1228–1235, 2017. ISSN15322157. doi:10.1016/j.ejso.2017.01.009. URL http://dx.doi.org/10.1016/j.ejso.2017.01.009. → pages 50
114
[164] L. Sun, A. M. Hui, Q. Su, A. Vortmeyer, Y. Kotliarov, S. Pastorino, A. Passaniti, J. Menon, J. Walling, R. Bailey,M. Rosenblum, T. Mikkelsen, and H. A. Fine. Neuronal and glioma-derived stem cell factor induces angiogenesiswithin the brain. Cancer Cell, 9(4):287–300, 2006. ISSN 15356108. doi:10.1016/j.ccr.2006.03.003. → pages 24
[165] C. Suo, O. Hrydziuszko, D. Lee, S. Pramana, D. Saputra, H. Joshi, S. Calza, and Y. Pawitan. Integration of somaticmutation, expression and functional data reveals potential driver genes predictive of breast cancer survival.Bioinformatics, 31(16):2607, Mar. 2015. doi:10.1093/bioinformatics/btv164. → pages 4
[166] S. Suthram, A. Beyer, R. M. Karp, Y. Eldar, and T. Ideker. eQED: an efficient method for interpreting eQTLassociations using protein networks. Molecular systems biology, 4(162):162, 2008. ISSN 1744-4292.doi:10.1038/msb.2008.4. → pages 5
[167] D. Szklarczyk, a. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, a. Roth, a. Santos,K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering. STRING v10: protein-protein interaction networks,integrated over the tree of life. Nucleic Acids Research, 43(D1):D447–D452, 2014. ISSN 0305-1048.doi:10.1093/nar/gku1003. URL http://www.ncbi.nlm.nih.gov/pubmed/25352553. → pages 67
[168] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos,K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering. String v10: proteinprotein interaction networks,integrated over the tree of life. Nucleic Acids Research, 43(D1):D447–D452, 2015. doi:10.1093/nar/gku1003. → pages23
[169] B. S. Taylor, N. Schultz, H. Hieronymus, A. Gopalan, Y. Xiao, B. S. Carver, V. K. Arora, P. Kaushik, E. Cerami,B. Reva, Y. Antipin, N. Mitsiades, T. Landers, I. Dolgalev, J. E. Major, M. Wilson, N. D. Socci, A. E. Lash, A. Heguy,J. a. Eastham, H. I. Scher, V. E. Reuter, P. T. Scardino, C. Sander, C. L. Sawyers, and W. L. Gerald. Integrative genomicprofiling of human prostate cancer. Cancer cell, 18(1):11–22, jul 2010. ISSN 1878-3686.doi:10.1016/j.ccr.2010.05.026. → pages 24
[170] TCGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407):330–7, July2012. ISSN 1476-4687. doi:10.1038/nature11252. URL https://www.ncbi.nlm.nih.gov/pubmed/22810696. → pages 3,77, 82, 84
[171] N. Tebbutt, M. W. Pedersen, and T. G. Johns. Targeting the ERBB family in cancer: couples therapy. Nature ReviewsCancer, 13(9):663–673, 2013. ISSN 1474-175X. doi:10.1038/nrc3559. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23949426. → pages 84
[172] J. R. Testa. Asbestos and Mesothelioma. Current Cancer Research. Springer International Publishing, 2017. ISBN978-3-319-53558-6. doi:10.1007/978-3-319-53560-9. URL http://link.springer.com/10.1007/978-3-319-53560-9. →pages 50
[173] P. Tetali. Design of on-line algorithms using hitting times. SIAM J. Comput., 28(4):1232–1246, 1999. → pages 13
[174] B. Thapa, A. Salcedo, X. Lin, M. Walkiewicz, C. Murone, M. Ameratunga, K. Asadi, S. Deb, S. A. Barnett, S. Knight,P. Mitchell, D. N. Watkins, P. C. Boutros, and T. John. The Immune Microenvironment, Genome-wide Copy NumberAberrations, and Survival in Mesothelioma. Journal of Thoracic Oncology, 12(5):850–859, 2017. ISSN 15561380.doi:10.1016/j.jtho.2017.02.013. URL http://dx.doi.org/10.1016/j.jtho.2017.02.013. → pages 51
[175] The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastomagenes and core pathways. Nature, 455(7216):1061–8, Oct. 2008. ISSN 1476-4687. doi:10.1038/nature07385. →pages 15, 22, 35, 44, 82
[176] The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353):609–15, June 2011. ISSN 1476-4687. doi:10.1038/nature10166. → pages 15, 22, 35, 36, 44
115
[177] The Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature,490(7418):61–70, Oct. 2012. ISSN 0028-0836. doi:10.1038/nature11412. → pages 15, 22, 35, 44, 82
[178] The Cancer Genome Atlas Research Network. The Molecular Taxonomy of Primary Prostate Cancer. Cell, 163(4):1011–25, nov 2015. ISSN 1097-4172. doi:10.1016/j.cell.2015.10.025. → pages 15, 22, 35, 36, 44
[179] H. Thorvaldsdottir, J. T. Robinson, and J. P. Mesirov. Integrative Genomics Viewer (IGV): High-performance genomicsdata visualization and exploration. Briefings in Bioinformatics, 14(2):178–192, 2013. ISSN 14675463.doi:10.1093/bib/bbs017. URL https://www.ncbi.nlm.nih.gov/pubmed/22517427. → pages 63
[180] S. a. Tomlins, D. R. Rhodes, S. Perner, S. M. Dhanasekaran, R. Mehra, X.-W. Sun, S. Varambally, X. Cao, J. Tchinda,R. Kuefer, C. Lee, J. E. Montie, R. B. Shah, K. J. Pienta, M. a. Rubin, and A. M. Chinnaiyan. Recurrent fusion ofTMPRSS2 and ETS transcription factor genes in prostate cancer. Science (New York, N.Y.), 310(5748):644–648, 2005.ISSN 0036-8075. doi:10.1126/science.1117679. → pages 36
[181] M. Torchala, P. Chelminiak, and P. A. Bates. Mean first-passage time calculations: Comparison of the deterministicHill’s algorithm with Monte Carlo simulations. European Physical Journal B, 85(4), 2012. ISSN 14346028.doi:10.1140/epjb/e2012-20760-8. → pages 7
[182] M. Torchala, P. Chelminiak, M. Kurzynski, and P. a. Bates. RaTrav: a tool for calculating mean first-passage times onbiochemical networks. BMC systems biology, 7:130, 2013. ISSN 1752-0509. doi:10.1186/1752-0509-7-130. URLhttp://www.ncbi.nlm.nih.gov/pubmed/24261882. → pages 7
[183] Z. Tu, L. Wang, M. N. Arbeitman, T. Chen, and F. Sun. An integrative approach for causal gene identification and generegulatory pathway inference. Bioinformatics, 22(14):489–496, 2006. ISSN 13674803.doi:10.1093/bioinformatics/btl234. → pages 6
[184] G. Ugurluer, K. Chang, M. E. Gamez, A. L. Arnett, R. Jayakrishnan, R. C. Miller, and T. T. Sio. Genome-basedMutational Analysis by Next Generation Sequencing in Patients with Malignant Pleural and Peritoneal Mesothelioma.Anticancer research, 36(5):2331–8, may 2016. ISSN 1791-7530. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27127140. → pages 51
[185] I. Ulitsky, A. Krishnamurthy, R. M. Karp, and R. Shamir. DEGAS: de novo discovery of dysregulated pathways inhuman diseases. PloS one, 5(10):e13367, oct 2010. ISSN 1932-6203. doi:10.1371/journal.pone.0013367. URLhttp://www.ncbi.nlm.nih.gov/pubmed/20976054. → pages 76, 87
[186] A. Untergasser, I. Cutcutache, T. Koressaar, J. Ye, B. C. Faircloth, M. Remm, and S. G. Rozen. Primer3–newcapabilities and interfaces. Nucleic acids research, 40(15):e115, aug 2012. ISSN 1362-4962. doi:10.1093/nar/gks596.URL http://www.ncbi.nlm.nih.gov/pubmed/22730293. → pages 65
[187] E. M. Van Allen, N. Wagle, P. Stojanov, D. L. Perrin, K. Cibulskis, S. Marlow, J. Jane-Valbuena, D. C. Friedrich,G. Kryukov, S. L. Carter, A. McKenna, A. Sivachenko, M. Rosenberg, A. Kiezun, D. Voet, M. Lawrence, L. T.Lichtenstein, J. G. Gentry, F. W. Huang, J. Fostel, D. Farlow, D. Barbie, L. Gandhi, E. S. Lander, S. W. Gray, S. Joffe,P. Janne, J. Garber, L. MacConaill, N. Lindeman, B. Rollins, P. Kantoff, S. A. Fisher, S. Gabriel, G. Getz, and L. A.Garraway. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples toguide precision cancer medicine. Nature medicine, 20(6):682–8, jun 2014. ISSN 1546-170X. doi:10.1038/nm.3559.→ pages 35
[188] E. Van Dyk, M. J. T. Reinders, and L. F. a. Wessels. A scale-space method for detecting recurrent DNA copy numberchanges with analytical false discovery rate control. Nucleic Acids Research, 41(9), 2013. ISSN 03051048.doi:10.1093/nar/gkt155. URL https://www.ncbi.nlm.nih.gov/pubmed/23476020. → pages 3
116
[189] F. Vandin, E. Upfal, and B. J. Raphael. Algorithms for detecting significantly mutated pathways in cancer. Journal ofcomputational biology : a journal of computational molecular cell biology, 18(3):507–22, Mar. 2011. ISSN1557-8666. doi:10.1089/cmb.2010.0265. → pages 6, 76
[190] F. Vandin, E. Upfal, and B. J. Raphael. De novo discovery of mutated driver pathways in cancer. Genome research, 22(2):375–85, Feb. 2012. ISSN 1549-5469. doi:10.1101/gr.120477.111. → pages 4, 76, 88
[191] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. Associating genes and protein complexes with disease vianetwork propagation. PLoS Computational Biology, 6(1), 2010. ISSN 1553734X. doi:10.1371/journal.pcbi.1000641.→ pages 6
[192] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto, et al. Inference of patient-specific pathway activities frommulti-dimensional cancer genomics data using PARADIGM. Bioinformatics (Oxford, England), 26(12):i237–45, June2010. ISSN 1367-4811. doi:10.1093/bioinformatics/btq182. → pages 4
[193] R. G. W. Verhaak, K. a. Hoadley, E. Purdom, V. Wang, Y. Qi, et al. Integrated genomic analysis identifies clinicallyrelevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer cell, 17(1):98–110, Jan. 2010. ISSN 1878-3686. doi:10.1016/j.ccr.2009.12.020. → pages 35
[194] R. Visconti, R. Della Monica, and D. Grieco. Cell cycle checkpoint in cancer: a therapeutically targetabledouble-edged sword. Journal of experimental & clinical cancer research : CR, 35(1):153, sep 2016. ISSN 1756-9966.doi:10.1186/s13046-016-0433-9. → pages 86
[195] B. Vogelstein, N. Papadopoulos, V. E. Velculescu, S. Zhou, L. a. Diaz, and K. W. Kinzler. Cancer genome landscapes.Science (New York, N.Y.), 339(6127):1546–58, mar 2013. ISSN 1095-9203. doi:10.1126/science.1235122. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23539594. → pages 1, 2, 10, 75
[196] V. Walter, A. B. Nobel, and F. a. Wright. DiNAMIC: A method to identify recurrent DNA copy number aberrations intumors. Bioinformatics, 27(5):678–685, 2011. ISSN 13674803. doi:10.1093/bioinformatics/btq717. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21183584. → pages 3
[197] K. Wang, M. Li, and H. Hakonarson. ANNOVAR: Functional annotation of genetic variants from high-throughputsequencing data. Nucleic Acids Research, 38(16):1–7, 2010. ISSN 03051048. doi:10.1093/nar/gkq603. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20601685. → pages 3, 63
[198] K. Wang, R. Shrestha, A. W. Wyatt, A. Reddy, J. Lehar, Y. Wang, A. Lapuk, and C. C. Collins. A meta-analysisapproach for characterizing pan-cancer mechanisms of drug sensitivity in cell lines. PloS one, 9(7):e103050, 2014.ISSN 1932-6203. doi:10.1371/journal.pone.0103050. URL http://www.ncbi.nlm.nih.gov/pubmed/25036042. → pages9
[199] M. D. Wilkerson and D. N. Hayes. ConsensusClusterPlus: A class discovery tool with confidence assessments and itemtracking. Bioinformatics, 26(12):1572–1573, 2010. ISSN 13674803. doi:10.1093/bioinformatics/btq170. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20427518. → pages 67
[200] L. D. Wood, D. W. Parsons, S. Jones, J. Lin, T. Sjoblom, R. J. Leary, D. Shen, S. M. Boca, T. Barber, J. Ptak,N. Silliman, S. Szabo, Z. Dezso, V. Ustyanksky, T. Nikolskaya, Y. Nikolsky, R. Karchin, P. a. Wilson, J. S. Kaminker,Z. Zhang, R. Croshaw, J. Willis, D. Dawson, M. Shipitsin, J. K. V. Willson, S. Sukumar, K. Polyak, B. H. Park, C. L.Pethiyagoda, P. V. K. Pant, D. G. Ballinger, A. B. Sparks, J. Hartigan, D. R. Smith, E. Suh, N. Papadopoulos,P. Buckhaults, S. D. Markowitz, G. Parmigiani, K. W. Kinzler, V. E. Velculescu, and B. Vogelstein. The genomiclandscapes of human breast and colorectal cancers. Science (New York, N.Y.), 318(5853):1108–1113, 2007. ISSN1095-9203. doi:10.1126/science.1145720. URL https://www.ncbi.nlm.nih.gov/pubmed/17932254. → pages 3
117
[201] A. W. Wyatt, F. Mo, K. Wang, B. McConeghy, S. Brahmbhatt, L. Jong, D. M. Mitchell, R. L. Johnston, A. Haegert,E. Li, J. Liew, J. Yeung, R. Shrestha, A. V. Lapuk, A. McPherson, R. Shukin, R. H. Bell, S. Anderson, J. Bishop,A. Hurtado-Coll, H. Xiao, A. M. Chinnaiyan, R. Mehra, D. Lin, Y. Wang, L. Fazli, M. E. Gleave, S. V. Volik, and C. C.Collins. Heterogeneity in the inter-tumor transcriptome of high risk prostate cancer. Genome biology, 15(8):426, aug2014. ISSN 1474-760X. doi:10.1186/s13059-014-0426-y. URL http://www.ncbi.nlm.nih.gov/pubmed/25155515. →pages 9
[202] M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, H. Ouyang, P. Radivojac, C. Sahinalp, F. Menczer,Y. Chang, A. Saha, H. Mamitsuka, and D. Yin. Ultra High-Dimensional Nonlinear Feature Selection for Big BiologicalData. IEEE Transactions on Knowledge and Data Engineering, 30(7):1352–1365, 2018. ISSN 1041-4347.doi:10.1109/TKDE.2018.2789451. URL https://doi.org/10.1109/TKDE.2018.2789451. → pages 9
[203] X. Yao, H. Hao, Y. Li, and S. Li. Modularity-based credible prediction of disease genes and detection of diseasesubtypes on the phenotype-gene heterogeneous network. BMC systems biology, 5(1):79, 2011. ISSN 1752-0509.doi:10.1186/1752-0509-5-79. URL http://www.biomedcentral.com/1752-0509/5/79. → pages 7
[204] E. Yeger-Lotem, L. Riva, L. J. Su, A. D. Gitler, A. G. Cashikar, O. D. King, P. K. Auluck, M. L. Geddie, J. S.Valastyan, D. R. Karger, S. Lindquist, and E. Fraenkel. Bridging high-throughput genetic and transcriptional datareveals cellular responses to alpha-synuclein toxicity. Nature genetics, 41(3):316–323, 2009. ISSN 1061-4036.doi:10.1038/ng.337. → pages 6
[205] K. Yoshihara, A. Tajima, D. Komata, T. Yamamoto, S. Kodama, H. Fujiwara, M. Suzuki, Y. Onishi, M. Hatae,K. Sueyoshi, H. Fujiwara, Y. Kudo, I. Inoue, and K. Tanaka. Gene expression profiling of advanced-stage serousovarian cancers distinguishes novel subclasses and implicates ZEB2 in tumor progression and prognosis. CancerScience, 100(8):1421–1428, 2009. ISSN 13479032. doi:10.1111/j.1349-7006.2009.01204.x. → pages 24
[206] K. Yoshihara, M. Shahmoradgoli, E. Martınez, R. Vegesna, H. Kim, W. Torres-Garcia, V. Trevino, H. Shen, P. W. Laird,D. a. Levine, S. L. Carter, G. Getz, K. Stemke-Hale, G. B. Mills, and R. G. W. Verhaak. Inferring tumour purity andstromal and immune cell admixture from expression data. Nature communications, 4:2612, 2013. ISSN 2041-1723.doi:10.1038/ncomms3612. URL http://www.ncbi.nlm.nih.gov/pubmed/24113773. → pages 58, 68
[207] K. Yoshihara, Q. Wang, W. Torres-Garcia, S. Zheng, R. Vegesna, H. Kim, and R. G. W. Verhaak. The landscape andtherapeutic relevance of cancer-associated transcript fusions. Oncogene, 34(37):4845–4854, 2014. ISSN 0950-9232.doi:10.1038/onc.2014.406. → pages 23, 35, 36
[208] Y. Yoshikawa, M. Emi, T. Hashimoto-Tamaoki, M. Ohmuraya, A. Sato, T. Tsujimura, S. Hasegawa, T. Nakano,M. Nasu, S. Pastorino, A. Szymiczek, A. Bononi, M. Tanji, I. Pagano, G. Gaudino, A. Napolitano, C. Goparaju, H. I.Pass, H. Yang, and M. Carbone. High-density array-CGH with targeted NGS unmask multiple noncontiguous minutedeletions on chromosome 3p21 in mesothelioma. Proceedings of the National Academy of Sciences of the United Statesof America, 113(47):13432–13437, 2016. ISSN 1091-6490. doi:10.1073/pnas.1612074113. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27834213. → pages 54, 75
[209] A. Youn and R. Simon. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics (Oxford,England), 27(2):175–81, Jan. 2011. ISSN 1367-4811. doi:10.1093/bioinformatics/btq630. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21169372. → pages 2
[210] H. Yu, H. Pak, I. Hammond-Martel, M. Ghram, A. Rodrigue, S. Daou, H. Barbour, L. Corbeil, J. Hebert, E. Drobetsky,J. Y. Masson, J. M. Di Noia, and E. B. Affar. Tumor suppressor and deubiquitinase BAP1 promotes DNA double-strandbreak repair. Proceedings of the National Academy of Sciences, 111(1):285–290, 2014. ISSN 0027-8424.doi:10.1073/pnas.1309085110. URL http://www.ncbi.nlm.nih.gov/pubmed/24347639. → pages 60
118
[211] S. Zaccaria, M. El-kebir, G. W. Klau, and B. J. Raphael. The Copy-Number Tree Mixture Deconvolution Problem andApplications to Multi-sample Bulk Sequencing Tumor Data. In S. C. Sahinalp, editor, Research in ComputationalMolecular Biology, pages 318–335, Cham, 2017. Springer International Publishing. ISBN 978-3-319-56970-3.doi:10.1007/978-3-319-56970-3 20. URL http://link.springer.com/10.1007/978-3-319-56970-3. → pages 78
[212] Q. Zhang, L. Ding, D. E. Larson, D. C. Koboldt, M. D. McLellan, K. Chen, X. Shi, A. Kraja, E. R. Mardis, R. K.Wilson, I. B. Borecki, and M. a. Province. CMDS: A population-based method for identifying recurrent DNA copynumber aberrations in cancer from high-resolution data. Bioinformatics, 26(4):464–469, 2009. ISSN 13674803.doi:10.1093/bioinformatics/btp708. URL https://www.ncbi.nlm.nih.gov/pubmed/20031968. → pages 3
119