computational prioritization of cancer driver

138
COMPUTATIONAL PRIORITIZATION OF CANCER DRIVER GENES FOR PRECISION ONCOLOGY by RAUNAK SHRESTHA B.Tech. Kathmandu University, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2018 c RAUNAK SHRESTHA, 2018

Upload: khangminh22

Post on 07-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

COMPUTATIONAL PRIORITIZATION OF CANCER DRIVERGENES FOR PRECISION ONCOLOGY

by

RAUNAK SHRESTHA

B.Tech. Kathmandu University, 2009

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Bioinformatics)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

August 2018

c© RAUNAK SHRESTHA, 2018

The following individuals certify that they have read, and recommend to the Faculty of Graduate and

Postdoctoral Studies for acceptance, the dissertation entitled:

Computational Prioritization of Cancer Driver Genes for Precision Oncology

submitted by Raunak Shrestha in partial fulfillment of the requirements for

the degree of Doctor of Philosophy

in Bioinformatics

Examining Committee:Dr. Colin C. Collins, Urologic Sciences

Supervisor

Dr. S. Cenk Sahinalp, Computer Science

Co-supervisor

Dr. Artem Cherkasov, Urologic Sciences

Supervisory Committee Member

Dr. David G. Huntsman, Pathology and Laboratory Medicine

University Examiner

Dr. Leonard Foster, Biochemistry and Molecular Biology

University Examiner

Additional Supervisory Committee Members:Dr. Yuzhuo Wang, Urologic Sciences

Supervisory Committee Member

Dr. Wan Lam, Pathology

Supervisory Committee Member

ii

Abstract

Advances in high-throughput sequencing technologies has drastically increased the efficiency to access

different alterations in the genome, transcriptome, proteome, and epigenome of a cancer cell. This has

increased the computational burden to analyze these “big data” making the translation of the knowledge

into insightful and impactful patient outcomes extraordinarily challenging.

Among these alterations, only a few “driver” alterations are expected to confer crucial growth advan-

tage. These are greatly outnumbered by functionally inconsequential “passenger” alterations. This poses

a significant challenge for the identification of driver alterations, requiring solutions to novel algorithmic

problems. Although, the insight on driver alterations is critical to guide selection of appropriate drug

therapies for the patient, no specific tools exist to help clinicians contextualize the enormous genomic

information when making therapeutic decisions.

In this thesis we describe novel algorithms for the identification and prioritization of cancer driver

genes. First we describe, HIT’nDRIVE, a combinatorial algorithm measuring the impact of genomic

aberration to global changes of gene expression pattern to prioritize cancer driver genes. We also demon-

strate its application on large multi-omics cancer datasets to guide precision oncology. We further de-

scribe integrative multi-omics characterization of peritoneal mesothelioma, a rare cancer of abdomen.

Here using HIT’nDRIVE, we identified peritoneal mesothelioma with BAP1 loss to form a distinct

molecular subtype characterized by distinct gene expression patterns of chromatin remodeling, DNA

repair pathways, and immune checkpoint receptor activation. We demonstrate that this subtype is cor-

related with an inflammatory tumor microenvironment and thus is a candidate for immune checkpoint

blockade therapies. Finally, we describe, cd-CAP, a combinatorial algorithm to identify subnetworks

with conserved molecular alteration pattern across a large subset of a tumor sample cohort. Notably, we

demonstrate that many of the largest highly conserved subnetworks within a tumor type solely consist of

genes that have been subject to copy number gain, typically located on the same chromosomal arm and

thus likely a result of a single, large scale copy number amplification.

iii

Lay Summary

Cancer arises as a result of deleterious aberrations on the genetic material and its product. The compo-

nents of the genetic material interact with each other forming extremely complex web of networks. The

accumulation of abnormalities in the genetic material results in perturbation of critical networks which

may ultimately give rise to tumor. Although many alterations accumulate in a tumor over its lifetime,

only a small fraction, known as “driver” alterations, are critical for tumor growth, while the majority of

“passenger” alterations are not essential. Identification of driver alterations in the vast milieu of passen-

ger alterations is a challenging task, but is critical for optimal cancer management.

In this thesis, we describe novel computational method using advanced mathematics and computer

science techniques to address the problems mentioned above. Here we demonstrate, how our compu-

tational tools establish linkage between driver alterations and tumour viability thus revealing novel bio-

logical insights to therapeutic strategies. This will guide the selection of appropriate anti-cancer drugs

and development of new ones. Thus we believe, this work will accelerate translation from discovery to

effective cancer treatment.

iv

Preface

In conjunction with my advisors, Dr. Colin C. Collins and Dr. S. Cenk Sahinalp, I was involved in the

conceptualization and design of research activities described in the thesis. In particular, I was designed,

developed, and implemented the computational algorithms described in this thesis. I performed majority

of data analysis for the molecular characterization of malignant peritoneal mesothelioma. I performed

the computational experiments, data analysis, and generation of figures, tables, and text in this thesis.

Where there are exceptions, they are noted below.

Chapter 1 was written by me.

Majority of the Chapter 2 and 3 was written by me. The HIT’nDRIVE algorithm development

was done in collaboration with Mr. Ermin Hodzic, Dr. Gholamreza Haffari, and Dr. S. Cenk Sahi-

nalp. I performed majority of data analysis, and generated tables and figures. Certain portion of the

computational experiments were performed by Mr. Ermin Hodzic. Chapteres 2 and 3 has been pub-

lished in: R. Shrestha, E. Hodzic, J. Yeung, K. Wang, T. Sauerwald, P. Dao, S. Anderson, H. Beltran,

M. A. Rubin, C. C. Collins, G. Haffari, and S. C. Sahinalp. HIT’nDRIVE: Multi-driver gene priori-

tization based on hitting time. Research in Computational Molecular Biology: 18th Annual Interna-

tional Conference, RECOMB 2014, Pittsburgh, PA, USA, April 2-5, 2014, Proceedings, pages 293–306,

2014. doi:10.1007/978-3-319-05269-4 23. URL http://dx.doi.org/10.1007/978-3-319-05269-4 23 and

R. Shrestha, E. Hodzic, T. Sauerwald, P. Dao, K. Wang, J. Yeung, S. Anderson, F. Vandin, G. Haf-

fari, C. C. Collins, and S. C. Sahinalp. HIT’nDRIVE: patient-specific multidriver gene prioritiza-

tion for precision oncology. Genome research, 27(9):1573–1588, sep 2017. ISSN 1549-5469. doi:

10.1101/gr.221218.117. URL https://www.ncbi.nlm.nih.gov/pubmed/28768687. HIT’nDRIVE software

is available through the following url: https://github.com/sfu-compbio/hitndrive

Chapter 4 was written by me. I performed majority of data analysis, and generated tables and figures.

This work was performed in collaboration with Dr. Noushin Nabavi. Dr. Andrew Churg, Dr. Htoo

Zarni Oo, Dr. Antonio Hurtado-Coll, Dr. Ladan Fazli, and Ms, Estelle Li generated Tissue Microarray,

performed pathological slide staining and slide reviews. Dr. Noushin Nabavi, Mr. Hans H. Adomat,

v

Mr. Robert Shukin, Mr. Brian McConeghy, Ms. Anne Haegert, and Ms. Sonal Brahmbhatt performed

experiments and data generation. Dr. Yen-Yi Lin, Dr. Fan Mo, Dr. Stanislav Volik, Mr. Shawn Anderson,

and Mr. Robert H. Bell performed various computational experiments. This study was approved by the

Institutional Review Board of the University of British Columbia and the Vancouver Coastal Health

(REB Number. H1500902 and V15-00902). All samples and information were collected with written

and signed informed consent from the participating patients. The pre-print version of this chapter is

available at: R. Shrestha, N. Nabavi, Y.-Y. Lin, F. Mo, S. Anderson, S. Volik, H. H. Adomat, D. Lin,

H. Xue, X. Dong, R. Shukin, R. H. Bell, B. McConeghy, A. Haegert, S. Brahmbhatt, E. Li, H. Z. Oo,

A. Hurtado-Coll, L. Fazli, J. Zhou, Y. McConnell, A. McCart, A. Lowy, G. B. Morin, M. Daugaard, S. C.

Sahinalp, F. Hach, S. Le Bihan, M. E. Gleave, Y. Wang, A. Churg, and C. C. Collins. Integrated Multi-

omics Molecular Subtyping Predicts Therapeutic Vulnerability in Malignant Peritoneal Mesothelioma.

bioRxiv, 2018. doi:10.1101/243477. URL https://doi.org/10.1101/2434777

Chapter 5 was written by me. This work was done in collaboration with Mr. Ermin Hodzic and Mr.

Kaiyuan Zhu. I performed data preparation, developed algorithm, performed data analysis as well as

generated tables and figures. The pre-print version of this chapter is available at: E. Hodzic, R. Shrestha,

K. Zhu, K. Cheng, C. C. Collins, and S. C. Sahinalp. Combinatorial detection of conserved alteration

patterns for identifying cancer subnetworks. bioRxiv, 2018. doi:10.1101/369850. URL https://doi.org/

10.1101/369850 cd-CAP software is available through the following url:

https://github.com/ehodzic/cd-CAP

vi

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Lay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Cancer driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Computational methods for the prediction of cancer driver genes . . . . . . . . . . . . . 2

1.2.1 Identification of recurrent somatic alterations . . . . . . . . . . . . . . . . . . . 2

1.2.2 Prediction of functional impact of somatic alterations . . . . . . . . . . . . . . . 3

1.2.3 Pathway and interaction-network based approaches . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 HIT’nDRIVE: an algorithm for cancer driver genes prioritization using hitting time . . . 102.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

vii

2.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 HIT’nDRIVE Algorithmic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Reformulation of RWFL as a Weighted Multi-Set Cover (WMSC) Problem . . . 13

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 HIT’nDRIVE parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 HIT’nDRIVE: expression outlier stringency . . . . . . . . . . . . . . . . . . . . 15

2.4.3 HIT’nDRIVE: random alterations and random expression outliers. . . . . . . . . 16

2.4.4 HIT’nDRIVE: network perturbation . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.5 HIT’nDRIVE: underlying network . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.6 Modified HIT’nDRIVE: when it is not required to prioritize at least one driver

gene per patient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.7 HIT’nDRIVE’s ability to capture CGC genes . . . . . . . . . . . . . . . . . . . 17

2.4.8 Correlation of predicted driver genes with alteration burden. . . . . . . . . . . . 18

2.4.9 Phenotype classification using dysregulated modules seeded with the predicted

driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.10 CGC cancer type-specific gene enrichment. . . . . . . . . . . . . . . . . . . . . 20

2.4.11 Phenotype classification using CGC gene seeded modules . . . . . . . . . . . . 20

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 Datasets and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.2 Interaction networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.3 Validation dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6.4 Derivation of expression outlier genes . . . . . . . . . . . . . . . . . . . . . . . 24

2.6.5 Derivation of expression outlier gene weights . . . . . . . . . . . . . . . . . . . 24

2.6.6 Statistical significance of the overlap of driver genes with that of CGC database. 25

3 Application of HIT’nDRIVE: patient-specific multi-driver gene prioritization for preci-sion oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 HIT’nDRIVE predicts frequent as well as infrequent driver genes in multi-omics

cancer datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Network properties of cancer driver genes . . . . . . . . . . . . . . . . . . . . . 37

viii

3.3.3 Breast cancer subtype classification using driver modules. . . . . . . . . . . . . 39

3.3.4 Subtype-specific breast cancer driver modules are associated with survival out-

come. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.5 HIT’nDRIVE seeded driver genes accurately predict drug efficacy . . . . . . . . 41

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.1 Datasets and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.2 Genomics of drug sensitivity in cancer . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.3 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.4 Association of driver modules with patients’ survival outcome . . . . . . . . . . 44

4 Integrated multi-omics molecular subtyping predicts therapeutic vulnerability in malig-nant peritoneal mesothelioma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Patient Cohort description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.2 Landscape of somatic mutations in PeM . . . . . . . . . . . . . . . . . . . . . . 52

4.3.3 Copy number landscape in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.4 Gene fusions in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.5 The global transcriptome and proteome profile of PeM . . . . . . . . . . . . . . 55

4.3.6 Transcriptional and post-transcriptional mechanisms regulate chromatin remod-

eling protein-complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.7 BAP1del subtype is characterized by distinct expression patterns of genes in-

volved in DNA repair pathway, and immune checkpoint receptor activation . . . 58

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5.1 Clinical samples and pathology evaluation . . . . . . . . . . . . . . . . . . . . . 61

4.5.2 Construction of tissue microarrays (TMAs) . . . . . . . . . . . . . . . . . . . . 61

4.5.3 Immunohistochemistry and Histopathology . . . . . . . . . . . . . . . . . . . . 62

4.5.4 Whole exome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.5 Somatic variant calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.6 Copy number aberration (CNA) calls . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.7 Transcriptome sequencing (RNA-seq) . . . . . . . . . . . . . . . . . . . . . . . 64

ix

4.5.8 Transcriptome (RNA-seq) quantification . . . . . . . . . . . . . . . . . . . . . . 64

4.5.9 Identification of fusion transcripts and validation . . . . . . . . . . . . . . . . . 65

4.5.10 Proteomics analysis using mass spectrometry . . . . . . . . . . . . . . . . . . . 65

4.5.11 Peptide identification and protein quantification . . . . . . . . . . . . . . . . . . 66

4.5.12 Mutational signature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.13 Prioritization of driver genes using HIT’nDRIVE . . . . . . . . . . . . . . . . . 67

4.5.14 Consensus clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.15 Protein attenuation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.16 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5.17 Stromal and immune score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5.18 Enumeration of tissue-resident immune cell types using mRNA expression profiles 68

4.5.19 External datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Combinatorial detection of conserved alteration patterns for identifying cancer subnet-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Algorithmic Framework of cd-CAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.1 Combinatorial Optimization Formulation . . . . . . . . . . . . . . . . . . . . . 78

5.4.2 Algorithmic Framework for solving MCSC . . . . . . . . . . . . . . . . . . . . 80

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5.2 Maximal Colored Subnetworks Across Cancer Types . . . . . . . . . . . . . . . 83

5.5.3 Maximal Colorful Subnetworks Across Cancer Types . . . . . . . . . . . . . . . 85

5.5.4 Multiple-Subnetwork Analysis Across Cancer Types . . . . . . . . . . . . . . . 85

5.5.5 Empirical P-Value Estimates Confirm the Significance of cd-CAP Identified Net-

works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.7.1 Significance of the Identified Subnetworks . . . . . . . . . . . . . . . . . . . . 88

5.7.2 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.7.3 Association of sub-networks with patients’ survival outcome . . . . . . . . . . . 89

x

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.1 Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xi

List of Tables

Table 5.1 Five subnetworks identified by cd-CAP in multi-subnetwork mode for each cancer

type: respective columns below depict the subnetwork size, depth, and the number of

nodes in the subnetwork with copy number amplification (AMP), expression increase

(EXP-UP) or decrease (EXP-DOWN). . . . . . . . . . . . . . . . . . . . . . . . . . 91

xii

List of Figures

Figure 2.1 Overview of HIT’nDRIVE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 2.2 HIT’nDRIVE identified driver genes with respect to varying parameter values in 100

selected BRCA samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 2.3 HIT’nDRIVE identified driver genes with respect to underlying network used in 100

selected BRCA samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Figure 2.4 Modified HIT’nDRIVE not required to prioritize at least one driver gene per patient. 29

Figure 2.5 Likelihood of HIT’nDRIVE to capture CGC Genes. . . . . . . . . . . . . . . . . . . 30

Figure 2.6 Correlation between the number of driver genes predicted by HITnDRIVE with mu-

tation rate and copy-number burden . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 2.7 Phenotype classification using driver-seeded modules . . . . . . . . . . . . . . . . . 32

Figure 2.8 Phenotype Classification using CGC Genes Seeded Modules. . . . . . . . . . . . . . 33

Figure 3.1 Summary of driver genes prioritized by HIT’nDRIVE . . . . . . . . . . . . . . . . 46

Figure 3.2 Network properties of driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 3.3 BRCA subtype classification using driver modules . . . . . . . . . . . . . . . . . . 48

Figure 3.4 Drug efficacy predicted by HIT’nDRIVE seeded driver genes. . . . . . . . . . . . . 49

Figure 4.1 Landscape of somatic mutations in PeM tumors . . . . . . . . . . . . . . . . . . . . 70

Figure 4.2 Landscape of copy number aberrations in PeM tumors . . . . . . . . . . . . . . . . 71

Figure 4.3 Gene fusions in PeM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Figure 4.4 Transcriptome and proteome profile of PeM . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.5 Immune cell infiltration in PeM tumors. . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 5.1 Schematic overview of cdCAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Figure 5.2 Conserved colored subnetworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xiii

Figure 5.3 Colorful maximal subnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Figure 5.4 Multiple subnetwork analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Figure 5.5 Empirical p-value estimates for the maximum size subnetworks identified by cd-CAP. 95

xiv

Glossary

AGC Automatic Gain Control

BAM Binary Alignment Map

BMR Background Mutation Rate

BRCA Breast adenocarcinoma

C-INDEX Concordance-index

CGC Cancer Gene Census

CNA Copy Number Aberration

COSMIC Catalogue of Somatic Mutations in Cancer

CRS Cytoreductive surgery

DEG Differentially expressed genes

DGIDB Drug-Gene interaction database

DNA Deoxyribonucleic acid

EQED eQTL Electrical Diagrams

EQTL Expression Quantitative Trait Loci

FL Facility Location

FDR False Discovery Rate

FFPE Formalin-Fixed Paraffin-Embedded

xv

GBM Glioblastoma multiforme

HIPEC Hyperthermic intraperitoneal chemotherapy

HR Hazard Ratio

HT Hitting Time

IGV Integrative Genomics Viewer

IHC Immunohistochemical

ILP Integer Linear Programming

INDEL Insertion and Deletion

KNN k-nearest neighbour

LP Linear Programming

MFPT Mean First Passage Time

NIPEC Normothermic intraperitoneal chemotherapy

OV Ovarian serous cystadenocarcinoma

PCR Polymerase Chain Reaction

PM Pleural Mesothelioma

PRAD Prostate adenocarcinoma

PSM Peptide Spectral Matches

QPCR Quantitative Polymerase Chain Reaction

RECOMB Research in Computational Molecular Biology

RNA Ribonucleic Acid

RT-PCR Reverse Transcription PCR

RWFL Random Walk Facility Location Problem

xvi

RWR Random Walk with Restart

SNV Single Nucleotide Variation

SV Structural Variation

TCGA The Cancer Genome Atlas

TMA Tissue microarray

TMZ Temozolomide

TNBC Triple-Negative Breast Cancer

VCF Variant Calling Format

WMSC Weighted Multi-Set Cover

xvii

Acknowledgments

This research was supported in part by the CIHR Bioinformatics Training Program, Prostate Cancer

Foundation - British Columbia (PCF-BC) Research Awards, and Mitacs Accelerate PhD Fellowship.

My deepest gratitude goes to my PhD supervisors, Dr. Colin Collins and Dr. Cenk Sahinalp, for

their endless support, encouragement, and guidance throughout my graduate studies. It was their com-

mitments, ideas, and constructive criticisms that help shape my research and successfully publish my

papers. Under their guidance I’ve had several opportunities to hone my technical and scientific skills in

computer science and biology. I have learned from them, not only to be a good researcher but learned

many lessons in life beyond the confined boundaries of a laboratory.

I would like to thank my thesis committee members - Dr. Yuzhuo Wang, Dr. Artem Cherkasov,

and Dr. Wan Lam for providing me guidance whenever I needed. Special thanks to Dr. Josh Stuart for

accepting to read and evaluate my thesis as an external examiner. I would also like to thank Dr. David G.

Huntsman and Dr. Leonard Foster for accepting to read and evaluate my thesis as university examiners.

I have been incredibly blessed to be part of Vancouver Prostate Centre family. My most sincere

gratitude to my colleagues Mr. Ermin Hodzic, Mr. Kaiyuan Zhu, and Dr. Noushin Nabavi with whom

I have had opportunity to closely collaborate in my different research works. I would like to thank Dr.

Anna Lapuk and Mr. Kendric Wang who mentored me during initial phase of my PhD. I offer my regards

to Dr. Mads Daugaard, Dr. Faraz Hach, Dr. Stanislav Volik, Dr. Stephane Le Bihan, and Dr. Alex Wyatt

for the their invaluable guidance.

I would like to thank my collaborators Dr. Gholamreza Haffari, Dr. Andrew Churg, Dr. Thomas

Sauerwald, Dr. Phoung Dao, Dr. Fabio Vandin, and Dr. Kuoyuan Cheng. I would also like to thank

my present and former colleagues Dr. Yen-Yi Lin, Dr. Fan Mo, Dr. Dong Lin, Dr. Nilgun Donmez, Dr.

Ibrahim Numanagic, Mr. Shawn Anderson, Mr. Hans H. Adomat, Mr Robert Shukin, Mr. Robert H.

Bell, Mr. Brian McConeghy, Ms. Anne Haegert, Ms. Sonal Brahmbhatt, Mr. Jake Yeung, Mr. Salem

Malikic, Mr. Alex Gawronski, Mr. Ehsan Haghshenas, Mr. Mike Ford, Mr. Varune Ramnarine, Mr.

Hossein Sharifi-Noghabi, and Mr. Hossein Asghari. I benefited greatly from these collaborations, and

xviii

hope to continue working with them.

Last, but not least, I heartily thank my family for the strong motivation that they gave me to follow

my studies. Their support was invaluable to me.

xix

Chapter 1

Introduction

1.1 Cancer driver genesCancer is a major cause of death across the globe and remains a growing challenge to health-care systems.

Cancer is characterized by uncontrolled division (malignant growth) of abnormal cells (tumors) in the

body. All cancer arise due to somatically acquired changes in Deoxyribonucleic acid (DNA), Ribonucleic

Acid (RNA), or protein sequences of the cancer cells.

Cancer is a complex disease caused by combination of different genetic changes. These genetic alter-

ations includes, but not limited to, Single Nucleotide Variation (SNV), Insertion and Deletion (INDEL),

Copy Number Aberration (CNA), Structural Variation (SV), gene fusions, changes in amino-acids se-

quence of a protein, DNA methylation, and changes in the gene and protein level expression. Combi-

nation of these genetic alterations dysregulate different oncogenic or tumor-supressive signaling path-

ways thus promoting cancer growth. Furthermore, cross-talks between different signaling pathways is

inevitable but is often less understood [72].

Cancer is an evolutionary disease. The genetic changes occur in initiating cells (clones) undergo

intense evolutionary selection during disease progression and can be widely altered during treatment.

Cancers evolve by reiterative process of clonal selection, clonal expansion, and genetic diversification

within the adaptive landscapes of tissue ecosystems [67]. The cancer cell evolutionary process may lead

to sub-clonal divergence resulting in genetic and molecular heterogeneity.

During tumor progression, cancer cells accumulate a multitude of genomic alterations; however most

are inconsequential “passenger” alterations that are effectively neutral. Nevertheless, a small fraction

provide mission-critical “hallmark” functions and are known as “driver” alterations that modify tran-

scriptional programs and therefore drive and sustain tumor progression [69, 161, 195]. Driver alterations

1

are evolutionary advantageous for the tumor development. These are causally implicated in oncogenesis

and some even trigger cancer progression, resistance to the disease or therapy. Driver alterations are pos-

itively selected during evolution of the cancer. Improving our knowledge on driver alterations, possibly

through an integrative analysis of various omics data is critical to better understand cancer mechanisms

and select appropriate therapies for specific cancer patients.

1.2 Computational methods for the prediction of cancer driver genesAll computational methods for the prediction of cancer driver genes can be broadly grouped into three

different strategic approaches: (a) identification of recurrent somatic alterations, (b) prediction of func-

tional impact of somatic alterations, and (c) pathway and interaction-network based approaches

1.2.1 Identification of recurrent somatic alterations

In the early cancer genomics studies, the driver mutation were identified on the basis of alterations that

appeared more frequently across the patient cohort than expected by random chance. These driver muta-

tions were thought to drives the cancer phenotype and provide selective advantage for clonal expansion

of its lineage.

Recurrent Somatic Mutation

Several popular computational tools such as MutSigCV [96], MuSiC [46], and others [77, 159, 209]

have been developed based on this strategy. These method aim to identify recurrence frequency of SNVs

with respect to the Background Mutation Rate (BMR) in a population of tumors [68, 195, 209]. The

BMR is the probability of observing a passenger mutation at a specific location of the genome. The

main difference between the tools mentioned above are in how they estimate the BMR and how many

mutational context they consider. BMR is not constant across the genome but depends on the genomic

context. BMR estimate greatly effects the identification of recurrent mutation. If the BMR is lower than

the true value, then it will lead to false-positives whereas if the BMR is higher than the true value, then

it will lose some recurrent mutations.

Recurrent copy-number alterations (CNAs)

The identification of recurrent CNAs in tumors presents different set of challenges. Unlike SNVs, CNAs

effect more than one gene. Somatic CNAs show a large variation in their position and length across

different tumors. For example, an oncogene can be amplified in a tumor because of whole-chromosome

2

duplication whereas in other tumors the same oncogene amplification may be focal where the ampli-

fied locus also contains the oncogene. These issues makes identification of somatic driver CNAs more

challenging. Thus the computational methods developed to study such problems take a non-parametric

approach. Early approaches to identify recurrent somatic driver CNAs relied on identification of shared

regions of CNAs across the tumor cohort. The statistical significance of such overlaps were assessed by

fixing the length of alterations but independently permuting their position in tumors populations. More

recent approaches such as GISTIC2 [115], CMDS [212], JISTIC [147], DiNAMIC [196], and ADMIRE

[188] use more sophisticated models to assess the statistical significance of overlapping CNAs of differ-

ent lengths.

Frequency based approach are best suited to study the driver genes that frequently altered across

the tumor population. However, less frequently altered genes dominate and vastly outnumber frequently

altered genes [200]. Recent whole-genome studies have revealed that important genes may be recurrently

altered in only a small fraction of the tumor cohort under study, and can be subtype-specific [128, 170].

Furthermore, in the context of tumor evolution, personalized rare driver genes are likely to arise during

advanced stages and may be isolated to a small fraction of tumor cells [49, 67]. Such rare or personalized

driver alterations may be functionally important and are likely to be missed by the frequency-based

approach.

1.2.2 Prediction of functional impact of somatic alterations

Another approach for distinguishing driver alterations from passenger alterations is to predict the func-

tional impact of a mutation using additional biological information about the sequence and/or structure

of the protein encoded by the mutated gene. These methods are applied to the non-silent SNVs that result

in changes in the amino-acid sequence of the corresponding protein. Several methods have been devel-

oped to predict the effect of SNVs. ANNOVAR [197] provide annotation of transcript variants. FunSeq

[86] includes additional annotation of non-coding elements and regulatory features. MutationAssessor

[139] combines protein domain information with evolutionary conservation model to identify functional

impact of somatic mutations. Furthermore, CHASM [29], TransFIC [65], and OncodriveFM [64] uses

machine-learning algorithms trained on known cancer mutations to highlight potential driver mutations.

ActiveDriver [137] predict effects that are related to protein aggregation, protein stability and alterations

of residues targeted by post-translational modification. Other popular methods to access the effect of

SNVs on protein function includes Condel [63], SIFT [157], and Polyphen [2].

3

1.2.3 Pathway and interaction-network based approaches

Genes and their protein product act on different hierarchies of biochemical organization. Moreover,

gene/proteins do not act in isolation rather they act together with other genes in a signaling, regulatory,

or metabolic pathway collectively known as interactome. Examination of the collection of identified

somatic alteration in the interactome can lead to better understanding of the cancer progression. However,

the complex nature of the interactome is the confounding factor for the identification of driver genes and

their corresponding signaling interaction network.

Many computational methods have been developed to assess signaling networks or pathways per-

turbed by somatic mutations in cancer. Perhaps the first computational method to consider large scale

genomic variants as driver events is CONEXIC [4]. It correlates genes with highly recurrent CNAs

with variation in gene expression profiles within a Bayesian network. CONEXIC uses a score-guided

search to identify the combinations of driver CNAs that best explains the patterns of gene-expression

modules in the tumor phenotype. Similarly, with no prior knowledge of pathways or protein interac-

tions, MOCA correlates gene mutation information with expression profile changes in other genes [112].

NetBox [32] uses the shortest-path approach to connect the somatically altered genes in an interaction

network and then identify statistically significant connected modules containing potential driver genes.

Method by Suo et al [165] prioritizes highly mutated genes that interact with large number of differ-

entially expressed genes in a gene network. MEMo [37], identifies sets of proximally-located genes

from interaction networks, which are also recurrently altered and exhibit patterns of mutually exclusivity

across the patient population. MEMo first defines modules of highly connected nodes in the network

and then assesses if these network modules show mutually exclusive mutations. RME algorithm [117]

identifies modules with exclusive patterns of mutations using an information theoretic measure to test

for the significance of the observed exclusivity. RME starts from scores that measures the exclusivity of

pairs of genes, and includes only genes mutated with relatively high frequency, limiting its effectiveness

in identifying rare driver mutations. Another approach, (Multi) Dendrix aims to simultaneously identify

multiple driver pathways, assuming mutual exclusivity of mutated genes among patients, using either a

Markov chain Monte Carlo algorithm [190] or Integer Linear Programming (ILP) [101]. XSEQ [48] uses

probabilistic model to compute influence of mutated genes over expression profile changes in other genes

by considering direct gene interactions. Two other methods - PARADIGM [192] and PARADIGM-SIFT

[125] uses Bayesian network to integrate genomic and transcriptomic data to infer pathways altered in a

patient.

4

Network propagation based methods

In recent years, network propagation based methods has been used extensively for identification of dis-

ease associated genes. The main principle assumption behind network propagation methods is that genes

that belong to same phenotype interact with each other resulting in the amplified biological signal [41].

These group of methods aim to identify the genes that are in close proximity to the known disease genes

in the signaling interaction network. The prior knowledge or experimental measurement obtained from

the genomic, transcriptomic, proteomic, or epigenomic profile of an individual(s) are superimposed on

the network. The signal from the “source” node is then propagated to a distant “target” node through the

edges in the global interaction network. Instead of finding the single path connecting the source to the

target, network propagation methods computes the fraction of the “flow” (originating from the source

node) passing through each of the intermediate node/edges to the target node. The fraction of the flow

imitates the probability of using the path in the information propagation process. Network propagation

approach gives us the ability to incorporate multiple data-types (such as mutations, genomic aberrations,

gene-expression, confidence level of interactions, and functional associations of genes) to the probabilis-

tic network models [35]. Due to its powerful nature to predict distant interactions, network propagation

is used in many different disciplines including computer science, engineering, physics, and biology. In

biology, network propagation has been used in the context of gene function prediction, gene-module

discovery, disease genes discovery, disease subtyping, and drug target prediction. Below, I will further

elaborated on network propagation based methods built to identify disease genes or cancer driver genes.

The current flow approach is one of the ways to model network propagation. Current flow approach

assumes the flow of current in an electronic circuit, where each edge has an associated resistance. It is

based on the well-known analogy between random walks (discussed below) and electronic networks

where the amount of current entering a node or an edge in the network is proportional to the expected

number of random walk visit on the node or edge. eQTL Electrical Diagrams (EQED) [166] integrates

Expression Quantitative Trait Loci (EQTL) analysis with molecular interaction network using the circuit

network model. To the best of our knowledge, NetQTL [87] is the first method to link CNAs to expression

profile changes within an interaction network and connects specific “causal” aberrant genes with potential

targets in the interaction network. They formulated a Weighted Multi-Set Cover (WMSC) problem and

provided a greedy solution to identify the set of causal genes.

Another network propagation approach is to use random-walk (also known as fluid diffusion, diffu-

sion kernel, or graph kernel). A random walk, as its name indicates, propagates randomly starting from

a known disease gene (i.e. a seed gene) to its neighbouring genes with equal probability or with a given

prior probability. This iterative process of random walk is halted after certain number of steps. In order

5

to capture local neighbourhood of the disease gene, a variant of this process known as Random Walk

with Restart (RWR) is used as an alternative to the halting process. In RWR, a reset parameter is used

which insures that the random walker return to the seed nodes after each step of propagation. In this way,

we can identify genes/proteins interacting with the disease-gene as they are the nodes most often visited

during the random-walk simulation. This approach helps to prioritize genes/proteins and interactions on

the basis of their potential involvement a particular disease. The network propagation algorithm was first

described by Kondor et al [92]. Network propagation algorithm have been used to analyze friendship

networks, where edges represent similarity or affinity. It is the basis of the original Google’s PageRank

algorithm [23].

Tu et al [183] used a random walk approach on a molecular interaction network to associate causal

genes and pathways explaining a given association and applied the method to the data obtained from

yeast knockout experiments. Methods by Kohler et al [91] and PRINCE [191] uses variant of the ran-

dom walk algorithm to prioritize disease associated genes/proteins. Kohler et al [91] demonstrated that

random walk analysis of interaction networks outperforms local network-based methods, such as shortest

path distances and direct interactions. Yeger-Lotem et al developed a method - ResponseNet [204] which

was later expanded by Lan et al [94]. ResponseNet uses network algorithm to relate genetic perturbations

to transcriptomic response in yeast model thereby identifying sub-networks of regulators mediating the

interactions. ResponseNet formulated a mininum-cost flow optimization problem which aims to maxi-

mize the flow between the source and target while minimizining the cost of the connecting paths. Thus

by setting the cost of an edge to the negative log of its probability, a high-probability connecting sub-

network is obtained. They provided a Linear Programming (LP) formulation to solve the optimization

problem. HotNet [189], was the first method to use a network propagation (fluid diffusion) approach

[136] to compute a pairwise influence measure between the genes in the (gene interaction) network and

identify sub-networks enriched with mutations. HotNet then derives a two-stage multiple hypothesis test

to reduce the False Discovery Rate (FDR) in sub-networks discovery. Another method, TieDIE [130],

extends the heat diffusion strategies of HotNet by leveraging two different type of genomic inputs: mu-

tated genes and transcriptional factors. TieDIE identifies a collection of pathways and sub-networks that

associate a fixed set of driver genes to expression profile change.

Another method, DriverNet [14], aims to correlate genomic alterations with target genes expression

profile changes, but only among direct interaction partners. The novel feature of DriverNet is that it aims

to find the “minimum” number of potential drivers that can “cover” targets. DriverNet provided a greedy

approximation algorithm to solve the optimization problem.

Hitting Time (HT)) or First Passage Time is an alternative approach for estimating node influence

in the (gene interaction) network using network propagation. HT on a network is simply the expected

6

minimum number of steps (hops) taken from a source node to reach a target node. Since HT relies on

the global topology of the network, there are many possible paths that connects the source node to the

target node given the sparseness of the biological networks. However, we cannot be certain about the

probability of reaching a target node given the number of steps or which path is the most probable. For

this reason, measuring the average hitting time (or Mean First Passage Time (MFPT)) is more reasonable

for pairwise influence calculations [152]. To calculate the average HT, random walk simulation can be

utilized where the transition probability of the nodes may have equal probabilities or some pre-defined

probabilities.

Liben-nowell et al [106] was the first to make use of HT for link-prediction problem on social

networks. Average HT has been previously used for analyzing state transition graphs in probabilistic

Boolean networks to identify gene perturbations that quickly lead to a desired state of the system [153].

Yao et al [203] estimated the closeness of a candidate gene to a disease of interest by computing the

HT of a random walk that starts at the corresponding disease phenotype and ends at the candidate. Con-

damin et al [39] developed a method for computing exact hitting times in a complex network, depending

on fractal dimension (i.e. density of nodes) and random walk dimension (i.e. source-target distance in the

network). Torchala et al [181] extended this method using Hill’s algorithm which make use of transition

probabilities between the node. They also demonstrated that Hill’s algorithm is an efficient method to

calculate average HT in a network. This was later implemented on C++ as RaTrav [182].

1.3 ContributionsIn this thesis, we focus on computational problems involving identification cancer driver genes, and their

application to guide precision oncology. Our goal here is to design network propagation based efficient

computational algorithm for cancer driver gene prioritization integrating multi-omics cancer datasets.

More specifically we present the following contributions:

• We introduce HIT’nDRIVE [154, 155], a combinatorial algorithm that measures the potential im-

pact of genomic aberrations on changes in the global expression of other genes/proteins which

are in close proximity in a gene/protein-interaction network. HIT’nDRIVE then prioritizes those

aberrations with the highest impact as cancer driver genes. HIT’nDRIVE formulates the driver

prioritization problem as a “random-walk facility location” (RWFL) problem, which differs from

the standard facility location problem by its use of “hitting time”, the expected number of hops

to reach a “target” gene from a “source” gene, as a distance measure in an interaction network.

HIT’nDRIVE uses “inverse” hitting time as a measure of influence of a source gene over a tar-

get gene to identify the subset of sequencewise altered/source genes whose overall influence over

7

expression altered/target genes is maximum possible.

• Using multi-omics data from different cancer types, we identified both known as well as rare (and

potentially novel) patient-specific driver genes. We also demonstrate that by using HIT’nDRIVE-

identified driver genes and associated “network modules” (sub-networks seeded by driver genes

whose aggregate expression profiles correlate well with the cancer phenotype) as features, it is

possible to perform accurate phenotype classification. In fact, we found a number of breast cancer

subtype-specific driver modules that are associated with patients’ survival outcome. Finally, we

demonstrate that HIT’nDRIVE-identified driver genes accurately predict drug efficacy in pan-

cancer cell lines.

• We present a first-in-field comprehensive integrative multi-omics analysis of a patient cohort of

treatment-nave peritoneal mesothelioma (PeM) [156]. In a novel contribution, using HIT’nDRIVE,

we identified PeM with BAP1 loss to form a distinct molecular subtype characterized by distinct

gene expression patterns of chromatin remodeling, DNA repair pathways, and immune checkpoint

receptor activation. We also demonstrate that this subtype is correlated with inflammatory tumor

microenvironment and thus a candidate for immune checkpoint blockade therapies. Our findings

reveal BAP1 to be a trackable prognostic and predictive biomarker for PeM immunotherapy that

refines PeM disease classification. This is significant because almost half of PeM cases are now

candidates for these therapies. BAP1 stratification may improve drug response rates in ongoing

phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM

in which BAP1 status is not considered. This integrated molecular characterization provides a

comprehensive foundation for improved management of a subset of PeM patients.

• Our another novel and significant contribution is that we resolved the large discordance between

mRNA and protein expression patterns in PeM cohort. Most of this discordance is attributed to

chromatin remodeling genes and proteins linked to multimeric protein complex. The majority of

which are direct protein-interaction partners of BAP1. The discordance between the mRNA and

the protein expression patterns is most likely due to the ubiquitination and degradation of proteins

in these BAP1 regulated complexes to maintain functional stoichiometry.

• Lastly, we present a novel computational method, cd-CAP (combinatorial detection of Conserved

Alteration Patterns), that primarily uses an ILP formulation to identify subnetworks of an interac-

tion network, each with an alteration pattern conserved across (a large subset of) a tumor sample

cohort. cd-CAP simultaneously identifies more than one subnetwork, and each gene within each

subnetwork has labels specific to the alteration types it harbors. Notably, we demonstrate that

8

many of the largest highly conserved subnetworks within a tumor type solely consist of genes that

have been subject to copy number gain, typically located on the same chromosomal arm and thus

likely a result of a single, large scale copy number amplification. We have also demonstrated that

the subnetworks identified using cd-CAP are associated with patients’ survival outcome and hence

are clinically important.

In addition to our primary contributions to the driver gene identification problems mentioned above,

our other contributions to the field of Computational Biology and Cancer Genomics can be found in

[61, 109, 149, 198, 201, 202]

1.4 Organization of the thesisThe rest of the thesis is organized as follows:

• In Chapter 2, we introduce HIT’nDRIVE, a combinatorial algorithm to prioritize cancer driver

genes. Then we present our experimental results exploring the behaviour of HIT’nDRIVE.

• In Chapter 3, we present extensive analysis of multi-omics data from multiple cancer types using

HIT’nDRIVE. Here we identify cancer driver genes in multi-omics cancer dataset as mentioned

above and explore their network properties. Then we demonstrate application of HIT’nDRIVE

on cancer phenotype and subtype classification, and drug efficacy prediction to guide precision

oncology.

• Chapter 4 describes integrative multi-omics characterization of a patient cohort of a rare cancer,

peritoneal mesothelioma. Here we demonstrate application of HIT’nDRIVE, which helped us

define a novel molecular subtype of peritoneal mesothelioma. We predicted this subtype would

likely respond to immunotherapy.

• In Chapter 5, we introduce cd-CAP, a combinatorial algorithm to identify sub-networks with con-

served molecular alteration pattern across a large subset of a tumor sample cohort. Then we present

our experimental results analyzing multi-omics data from multiple cancer types using cd-CAP.

• Finally, in Chapter 6, we offer a summary and conclusion of our contributions to cancer driver

gene identification, as well as discussion of possible directions for future work.

9

Chapter 2

HIT’nDRIVE: an algorithm for cancerdriver genes prioritization using hittingtime

2.1 IntroductionGenomic and transcriptomic alterations are the major contributors of tumorigenesis and progression

of cancer. Over the past decade, high-throughput sequencing efforts have provided an unprecedented

opportunity to identify these alterations in cancer that can lead to changes in gene regulation, protein

structure, and function [161]. Genomic and transcriptomic data provide unique and complementary

information about a particular tumor, but the translation of “big” molecular data into insightful and

impactful patient outcomes is extraordinarily challenging [195]. As explained in Chapter 1, during tumor

progression, cancer cells accumulate a multitude of genomic alterations with most being inconsequential

“passenger” alterations that are effectively neutral. However, a small fraction provide mission-critical

“hallmark” functions and are known as “driver” alterations that modify transcriptional programs and

therefore drive and sustain tumor progression [69, 161, 195]. The knowledge of driver alterations is

foundational to guide selection of appropriate therapies. For this we need to better integrate different

omics data-types and distinguish critical driver events from others.

Among different strategies explained in Chapter 1, the ones based on mutual exclusivity still fo-

cus on frequent events. The others, based on “information flow” in gene/protein interaction networks,

do not aim to discover cancer drivers, but rather are designed to identify dysregulated sub-networks or

10

modules. In addition, the notion of influence they employ is based on stationary distribution of “informa-

tion” originating at a particular gene/protein. As a result, none of the available methods aim to identify

rare, patient-specific driver events, based on a time dependent notion of influence. Finally, none of the

available techniques aim to simultaneously consider different types of genomic alterations as potential

drivers.

2.2 Our ContributionsTo address the above challenges, in this chapter, we introduce a novel combinatorial method, HIT’nDRIVE

[154, 155], which was first presented at the Research in Computational Molecular Biology (RECOMB)

conference . HIT’nDRIVE is a combinatorial algorithm that measures the potential impact of genomic

aberrations on changes in the global expression of other genes/proteins which are in close proximity

in a gene/protein-interaction network. HIT’nDRIVE then prioritizes those aberrations with the highest

impact as cancer driver genes. HIT’nDRIVE formulates the driver prioritization problem as a “random-

walk facility location” (RWFL) problem, which differs from the standard facility location problem by its

use of “hitting time”, the expected number of hops to reach a “target” gene from a “source” gene, as a

distance measure in an interaction network. HIT’nDRIVE uses “inverse” hitting time as a measure of

influence of a source gene over a target gene to identify the subset of sequencewise altered/source genes

whose overall influence over expression altered/target genes is maximum possible.

Since RWFL problem is NP-hard, we estimate the multi-hitting time based on the independent hitting

times of the drivers to an expression outlier, which provides an upper bound on the multi-hitting time.

Our experiments show that this estimate works well for the human protein interaction network. More

importantly, our estimate enables us to reduce the RWFL problem to a Weighted Multi-Set Cover (WMSC)

problem, for which we give an ILP formulation.

2.3 HIT’nDRIVE Algorithmic FrameworkHIT’nDRIVE links alterations at the genomic level to changes at transcriptome level using gene/protein

interaction network. For that, it aims to find the smallest set of altered genes that can explain most of

the observed transcriptional changes in the cohort. In other words, HIT’nDRIVE identifies the minimum

number of potential drivers which can cause a user-defined proportion of the downstream expression

effects observed. We formulate this as a Random Walk Facility Location Problem (RWFL) problem, a

combinatorial optimization problem that we introduce here. RWFL generalizes the classical Facility

Location (FL) problem by changing the notion of distance it uses. Given a network, FL problem defines

the distance between a potential driver gene and an outlier gene as the length of the shortest path between

11

them. The RWFL problem, in contrast, uses “hitting time” [39, 106], the expected length of a random

walk between the two nodes, as their distance. Under the use of hitting time, the FL problem completely

changes nature: in the classical FL formulation the goal is to associate each outlier gene in the network

with exactly one (the closest) driver gene. In the RWFL formulation, each outlier gene is associated with

multiple drivers (whose collective distance to the outlier will no longer be the shortest pairwise distance),

forming a many-to-many relation. Intuitively, hitting time measures how accessible a particular outlier

gene is from potential drivers. Thus RWFL problem asks to find the smallest set of sequence-altered

genes from which one can reach (a good proportion of) outliers within a user defined “multi-hitting

time” - the expected length of the shortest random walk originating from any of the sequence altered

genes, and ending at an outlier.

In order to capture the uncertainty of interactions of genes with their neighbours, it considers a

random walk process which propagates the effect of sequence alteration in one gene to the remainder

of the genes through the network. As a result, the influence is defined to be the inverse of hitting-time,

which is the expected length (number of hops) of a random walk which starts at a given potential driver

gene, and “hits” a given target gene the first time in an interaction network. More specifically, for any

two nodes u,v ∈V of an undirected, connected graph G = (V,E), let the random variable τu,v denote the

number of hops in a random walk starting from u and visiting v for the first time. Then the hitting-time

Hu,v is defined as Hu,v = E[τu,v] [104].

In order to capture synthetic lethality like scenarios, HIT’nDRIVE considers multiple aberrated genes

as potential drivers. For that, we define the influence value (of a set of potential driver genes on a target) as

the inverse of multi-hitting time. More specifically, let U ⊆V be a subset of nodes of G and v∈ (V−U)be a single node. We thus define the multi(source)-hitting time HU,v as HU,v = E[minu∈U τu,v].

Now the RWFL problem for a single patient can be described as follows. Let X be a set of potential

driver genes and Y be a set of expression altered (outlier) genes. Then, for a user defined k, HIT’nDRIVE

can aim to return k potential driver genes as solution to the following optimization problem:

argminX⊆X ,|X |=k maxy∈Y

HX ,y

where HX ,y denotes the multi-hitting time from the gene set X to the gene y.

As per the standard facility location problem, RWFL is NP-hard. In fact, even the problem of comput-

ing the multi-hitting time between a set of nodes in a network and a particular target node is difficult. We

overcome this difficulty by introducing a good estimate on the multi-hitting time that helps us to reduce

RWFL problem to the Weighted Multi-Set Cover (WMSC), which we solve through an ILP formulation.

(Although the use of set-cover for representing the most parsimonious solution in a bioinformatics con-

12

text is not new [75], to the best of our knowledge this is the first use of the multi-set cover formulation for

maximum parsimony.) In this formulation, we use a slightly different objective: given a user defined up-

per bound on the maximum multi-hitting time, we now aim to minimize the number of potential drivers

that can “cover” (a user defined proportion of) the outlier genes. For more than one patient, we minimize

the number of drivers that can “cover” (a user defined proportion of) patient-specific outliers such that

each such outlier is covered by potential drivers that are aberrant in that patient.

2.3.1 Reformulation of RWFL as a Weighted Multi-Set Cover (WMSC) Problem

For simplicity, we first describe how HIT’nDRIVE works on single patient data. Given an interaction

network with X denoting the set of sequence-altered genes (through SNVs or SVs) and Y denoting

the set of expression-altered genes, HIT’nDRIVE computes the smallest subset of X whose joint “in-

fluence” over (a user defined fraction of) expression-altered genes is sufficiently high (i.e. above a user

defined threshold). The influence of a set of (sequence-altered) genes X over an expression-altered gene

g is defined as 1MHT (X ,g) , where MHT (X ,g) denotes the multi-hitting time, the expected length of the

shortest random walk originating at each one of the genes in X that ends at g. Therefore, HIT’nDRIVE

aims to solve the RWFL problem in a network where X are the “potential facilities” and Y are the

“requests”.

Since RWFL is a computationally hard problem, and cannot be solved in a reasonable amount of

time in its original formulation, we reduce the RWFL problem to the WMSC problem, for which we

give an ILP formulation. Intuitively, in this new formulation, HIT’nDRIVE associates the genomic

alterations with transcriptomic changes in the form of a bipartite graph Gbip(X ,Y ,E ) where X is the

set of aberrant genes, Y is the set of patient-specific expression-altered genes, and E is the set of edges.

If gene xi is mutated in a patient p, we set edges between xi and all of the expression altered genes in

the same patient (y j, p) where the edges are weighted by the inverse pairwise hitting times wi j := H−1xi,y j

(Figure 2.1A). The WMSC problem on this representation of data asks to find the smallest subset of X

(as potential drivers) whose total influence (sum of pairwise influence values) over a user defined fraction

of expression-altered genes (for each patient) is sufficiently high.

The reduction from RWFL problem to the WMSC problem is achieved by estimating the multi-

hitting time as a function of independent hitting times of the drivers to an outlier, which provides an

upper bound on the multi-hitting time. The exact individual hitting times are calculated by a matrix

inversion method [173]. The resulting WMSC problem can then be formulated as the ILP below, which

is efficiently solvable by CPLEX (within minutes) for all data sets we considered.

13

minx1,..,x|X | ∑i xi

s.t.

∀i, j : xi = ei j

∀ j : ∑i ei jwi j ≥ y jγλ j ∑i wi j

∑ j y j ≥ α|Y |∀p : argβ f ractiono f highestλ j

(y j) = 1

xi,ei j,y j ∈ 0,1

The above ILP formulation for the WMSC problem introduces binary variables xi, y j, ei j, respec-

tively, for each potential driver, expression-alteration event, and edge in the bipartite graph. The objective

of the ILP is to minimize the number of drivers (i.e. the sum of xi values) subject to four constraints. The

first constraint ensures that a selected driver contributes to the coverage of each of the expression alter-

ation events it is connected to (in each patient, if multiple patients are available). The second constraint

ensures that selected (patient-specific) driver genes contribute enough to cover at least a (γ) fraction of

the sum of all incoming edge weights to each expression alteration event. This constraint corresponds to

setting an upper bound on our estimate on the inverse of multi-hitting time of the selected (patient spe-

cific) drivers on an expression alteration event. The third constraint ensures that the selected driver genes

collectively cover at least an α fraction of the set of expression alteration events. And the fourth con-

straint ensures that for each patient, the top β fraction of expression altered genes with highest weights

(λ j) are always covered.

As indicated above, our ILP formulation for WMSC problem can be generalized to multiple patients

with the objective of minimizing the total number of driver genes across all patients, subject to the

constraint that a user-defined proportion of outlier genes in each of the patients are covered by the subset

of drivers present in that patient.

In order to quantitatively assess the genes identified by HIT’nDRIVE, we extended our previously de-

veloped algorithm, OptDis [44], for de novo identification of modules of small size inside the interaction

network which contain (i.e. are seeded by) at least one predicted driver. The modules are chosen so that

their discriminative power (for phenotype classification) is the greatest among connected sub-networks

of similar size that contain the individual predicted drivers. In general, OptDis performs supervised di-

mensionality reduction on the set of connected sub-networks. It projects the high dimensional space of

all connected sub-networks to a user-specified lower dimensional space of sub-networks such that, in the

new space, the samples belonging to the same class are closer and the samples from different class are

more distant to each other (i.e. minmize in-class distant and maximize out-class distance) with respect

to a normalized distance measure (typically L1). Then we use module features (average expression of

14

genes in the module) for phenotype classification (Figure 2.1B-C). Using such module features, we hope

that the classifier in use does not overfit on rare drivers and is able to generalize the signal coming from

rare drivers to new patients. We report the classification accuracy based on the identified driver-seeded

modules as means of quantitative validation of our results (in the absence of ground truth). We also

look at the genes that build the chosen modules (of high classification accuracy) in attempt to identify

cancer-related pathways.

2.4 ResultsWe have implemented HIT’nDRIVE in C++ and solved the ILP using IBM CPLEX version 12.5.1. We

first tested the behaviour and robustness of HIT’nDRIVE given different parameters used in the algo-

rithm. These in silico experiments were performed using multi-omics data from four major cancer types

- Glioblastoma multiforme (GBM) [175], Ovarian serous cystadenocarcinoma (OV) [176], Breast ade-

nocarcinoma (BRCA) [177], and Prostate adenocarcinoma (PRAD) [178] obtained from the The Cancer

Genome Atlas (TCGA) data portal. Here we describe the results exploring the behaviour of HIT’nDRIVE

algorithm when used for the analysis of multi-omics cancer datasets. The biologically motivated results

obtained using HIT’nDRIVE are extensively discussed in Chapter- 3.

2.4.1 HIT’nDRIVE parameters

HIT’nDRIVE uses three user-specified input parameters:

1. α: fraction of outliers to be covered overall (across all patients)

2. β : fraction of outliers to be covered in each patient

3. γ: fractional lower bound on the sum of the incoming edge weights from driver genes selected by

HIT’nDRIVE

HIT’nDRIVE is robust with respect to the changes in α and β but is somewhat sensitive to γ (Fig-

ure 2.2A-B), as expected. However, as γ grows, the driver genes identified by HIT’nDRIVE do not

change but simply grow in number by the addition of new driver genes, which indicates robustness of

our method with respect to γ as well.

2.4.2 HIT’nDRIVE: expression outlier stringency

The higher the stringency we apply on the expression value change in a potential outlier, the fewer

outliers we will identify, which in turn will result in fewer number of driver genes. However, the new

15

set of driver genes obtained are, in general, a subset of the first set of driver genes, again indicating

robustness (Figure 2.2C-D).

2.4.3 HIT’nDRIVE: random alterations and random expression outliers.

We compared the HIT’nDRIVE predictions of driver genes among observed mutations with those ob-

tained through randomized mutations (Figure 2.2E) and random outliers (Figure 2.2F). There is a stark

contrast between the two sets of driver gene predictions with respect to their overlap with the Cancer

Gene Census (CGC) [59] data set - conserved through different values of the γ parameter (the overlap is

generally preserved across various settings of the remaining two parameters, namely α and β ). Driver

genes predicted in the non-randomized alteration (or non-randomized outliers) data not only (i) included

a higher number of CGC genes (i.e. more number of true driver genes) as compared to that in driver

genes predicted from randomized alterations (or randomized outliers) data, but also (ii) the number of

CGC driver genes predicted through the use of non-randomized data increased quickly with increasing γ

parameter, whereas it stays roughly the same when randomized data was used. Note that while perform-

ing randomization, the original gene labels (sequence-wise altered genes or expression-outlier genes)

were randomly replaced by new ones while preserving their recurrence frequency distributions.

2.4.4 HIT’nDRIVE: network perturbation

We used STRING v10 network for our analysis. The edges of the STRING v10 network was perturbed

to different extent (between 1-10%) preserving the degree of the nodes in the network. HIT’nDRIVE

analysis was performed using different perturbed networks. Proportion of common driver genes between

the unperturbed network and each of the perturbed network were calculated (Figure 2.3A-E). We ob-

served that even though the edges of the network were perturbed, the list of driver genes did not change

to a great extent (i.e. the overlap of driver genes was very high) as compared to the non-perturbed net-

work even when the edges of the network were perturbed by up to 10%. This clearly demonstrates that

HIT’nDRIVE is not biased towards network perturbations.

2.4.5 HIT’nDRIVE: underlying network

We evaluated the robustness of HIT’nDRIVE on three networks, namely STRING, HPRD and the RE-

ACTOME. Only 34% of the vertices in STRING, HPRD, and the REACTOME are shared in all three

networks; in terms of edges, an even smaller proportion of the edges. Not surprisingly, the more nodes

the network has, the more driver genes HIT’nDRIVE predicts. This is consistently observed across var-

ious parameter settings. What is noteworthy is that the percentage overlap between the driver genes

16

predicted on the three networks is quite robust, i.e., the percentage of driver genes shared between all

three networks is preserved across various parameter settings - e.g. this overlap is above 60% between

the REACTOME and any of the other two networks, across various values of gamma - which is quite

impressive. In fact the driver genes predicted on STRING are almost a superset of those predicted on

REACTOME. See Figure 2.3F.

2.4.6 Modified HIT’nDRIVE: when it is not required to prioritize at least one drivergene per patient.

In HIT’nDRIVE, at least one gene is picked per patient (i.e. when the β > 0). This constraint is based on

the implicit assumption that at least one causal mutation should be driving cancer (although there could

be exceptions to this, for example, the driver event could be something other than genomic alteration,

and be in the form of methylation, aberrant expression of a regulatory RNA or a metabolite, they could

all be incorporated in our framework, given matching data - which unfortunately is not available through

TCGA). There are also important performance issues related to the value of beta: (1) Setting β > 0

significantly improves the robustness of our method with respect to the alpha parameter. In Figure 2.4, it

can observed that the alpha parameter has minimal effect on the output of our method - provided beta is

non-zero. If β = 0 (i.e. patients do not necessarily have one driver gene), our method is less robust, as can

be seen in Figure 2.4B. In Figure 2.4C, especially for small values of alpha, the number of patients that

do not have a driver gene increases as the value of gamma decreases. In the worst case,∼40% of patients

do not report a driver gene; this happens when α = 0.5 and γ = 0.02. For guaranteeing robustness, the

γ value should be set above 0.2 and the α value should be set above 0.7, which reduces to the fraction

of patients with no driver genes to 5%. (2) Setting β = 0 significantly increases the running time of our

method, from a couple of minutes to several days on very large datasets.

2.4.7 HIT’nDRIVE’s ability to capture CGC genes

To check if HIT’nDRIVE is able to capture the true driver genes, we perform the following analysis. For

the sake of this analysis, let us first assume that the cancer-type specific genes listed in CGC database are

the true driver genes i.e. the ground truth. We predicted potential driver genes in patients from four major

cancer types using HIT’nDRIVE (for details see Chapter 3). For every patient analyzed, we compared the

input (i.e. all sequence-wise altered gene) and the output (i.e. subset of the input sequence-wise altered

genes that are predicted as potential driver genes) data for HIT’nDRIVE. We compared the amount of

CGC true driver genes present in the input data versus amount of CGC true driver genes captured by

HIT’nDRIVE.

17

The Figure 2.5A-D summarizes the results of this analysis. As can be seen, the likelihood of a

sequence-wise altered CGC gene to be prioritized by HIT’nDRIVE is much higher than that of a non-

CGC genes. Next, for each patient, we calculated the likelihood of HIT’nDRIVE to capture CGC genes

(see Section 2.6.6 for details). We found that majority of the samples analyzed have a very significant p-

value (i.e. < 0.01) (Figure 2.5E). This analysis demonstrates that HIT’nDRIVE is able to capture cancer

driver genes, to a larger extent, in the patient samples analyzed.

2.4.8 Correlation of predicted driver genes with alteration burden.

To obtain the mutation rate, we calculated the somatic mutation frequency per Mb (considering mutations

in protein-coding genes only). We obtained copy-number burden values (i.e. percentage of somatic copy-

number genome changed) using BioDiscovery Nexus Copy Number software. Figure 2.6A summarizes

the correlation between mutation rate and copy-number burden. As reported in many recent studies,

samples in OV, PRAD and BRCA had high copy-number burden. In case of GBM, majority of samples

had more or less equal mutation and copy-number burden.

Figure 2.6B shows the correlation of number of HIT’nDRIVE predicted driver genes with Muta-

tion rate. Except for a few highly mutated samples in BRCA, the number of driver genes predicted

by HIT’nDRIVE was not correlated with the somatic mutation rate of the respective sample. Finally,

Figure 2.6C shows the correlation of number of HIT’nDRIVE predicted driver genes with copy-number

burden. Here too we observed the number of HIT’nDRIVE predicted driver genes were largely indepen-

dent of the somatic copy number burden in the genome. Therefore, except for the hypermutated cases, the

number of HIT’nDRIVE predicted driver genes is independent of both mutation rate and copy-number

burden.

2.4.9 Phenotype classification using dysregulated modules seeded with the predicteddriver genes

Evaluating computational methods for predicting cancer driver genes is challenging in the absence of the

ground truth (i.e. follow-up biological experiments). Therefore, we mainly focused on testing whether

our predictions provide insight into the cancer phenotype and improve classification accuracy on an

independent cancer dataset. To test association of the driver genes identified by HIT’nDRIVE with the

cancer phenotype, as explained in the earlier section, we used the driver gene seeded gene-modules, a

set of functionally related genes (e.g. in a signaling pathway), from the protein interaction network, as

features for classifying the cancer phenotype. Using OptDis (here referred to as HIT’nDRIVE-OptDis),

we identified small connected sub-networks that include (i.e. are seeded by) predicted driver genes in

18

a greedy fashion. More specifically, we prioritized sub-networks (of at most seven genes) iteratively so

that in each iteration we identified the sub-networks that maximally discriminates sample phenotypes

in a gene-expression matrix, among the sub-networks that share very few genes (at most 20%) with the

sub-networks already prioritized.

Furthermore, we have also developed an unsupervised method for module identification (here re-

ferred to as HIT’nDRIVE-unsupervised), i.e. one that does not depend on any phenotype information.

This unsupervised method seeds each module with one HIT’nDRIVE identified driver gene, and includes

outlier genes that it has influence over and co-occurs with significantly across patients. For this, we per-

form a hypergeometric test to identify significant driver-outlier interaction (i.e. mutual presence) pairs

across the patient cohort (pvalue < 10-3).

Here we compare HIT’nDRIVE-OptDis and HIT’nDRIVE-unsupervised to another network based

driver genes prioritization method - DriverNet [14]. DriverNet itself does not aim to identify modules that

we can use to compare against HIT’nDRIVE-OptDis or HIT’nDRIVE-unsupervised modules. Rather,

DriverNet identifies driver genes in an iterative fashion, where in each iteration, DriverNet picks the

driver genes which “covers” the maximum number of uncovered outliers. We use this driver and the

outlier genes it covers as the “next” DriverNet module.

We used the set of prioritized sub-networks, i.e. the driver modules, first, to perform binary sample

classification: tumor vs normal. For this, we used gene-expression data for each of the four cancer

types (GBM, OV, PRAD and BRCA) from TCGA as discovery datasets to calculate the mean gene

expression value for each sub-network/driver module, for each patient. On these sub-networks, we used

the k-nearest neighbour (KNN) classifier (with k = 1), to perform classification on both the expression

values from TCGA, and additional validation gene-expression datasets (Figure 2.7A-C). The additional

validation datasets were used in order to assess the capability of the modules identified on TCGA cohort,

in classifying other cohorts.

For every dataset analyzed, the maximum classification accuracy achieved by HIT’nDRIVE mod-

ules (either HIT’nDRIVE-unsupervised or HIT’nDRIVE-OptDis), for any number of modules consid-

ered, was higher than that achieved by DriverNet modules (Figure 2.7A). Moreover, in most datasets,

HIT’nDRIVE methods achieve maximum or near-maximum accuracy using a smaller fraction of mod-

ules. All three methods achieved perfect or near perfect classification accuracy in TCGA-GBM, TCGA-

OV and TCGA-BRCA datasets except for TCGA-PRAD dataset (where the maximum classification

accuracy achieved was 90% by HIT’nDRIVE-Unsupervised, 95% by HIT’nDRIVE-OptDis and 86% by

DriverNet). Overall, the driver modules (identified in one cohort) were able to distinguish the tumor

phenotype from normal very well in validation datasets (on other cohorts) supporting the relevance of

the identified driver genes to the cancer phenotype.

19

2.4.10 CGC cancer type-specific gene enrichment.

Next, we looked into the list of prioritized driver genes by both HIT’nDRIVE and DriverNet and their

overlap with the known CGC genes (Figure 2.7B). DriverNet selects a much larger number of driver

genes, as compared to HIT’nDRIVE, to cover most outlier genes (across all four cancer types) due

to its model considering only direct interactions in the network. In particular, in OV and BRCA, the

number of HIT’nDRIVE identified driver genes are an order of magnitude smaller than that of DriverNet.

Although in GBM and PRAD datasets, the number of driver genes identified by DriverNet is somewhat

lower and comparable to that identified by HIT’nDRIVE (primarily because most outliers were filtered

out due to sharing no interaction edge with candidate altered genes), HIT’nDRIVE identified driver

genes cover a significantly larger number of outliers. More importantly, even though HIT’nDRIVE

identifies a smaller number of driver genes, a larger fraction of these driver genes can be found in CGC

database - in comparison to the DriverNet identified driver genes. In fact, even a larger fraction of CGC

genes specific to the relevant cancer type can be found among HIT’nDRIVE identified driver genes.

Specifically, HIT’nDRIVE predicted four glioblastoma specific CGC genes (IDH1, PDGFRA, PIK3CA

and PIK3R1) in TCGA-GBM dataset. Among them, IDH1, PDGFRA and PIK3CA were not identified by

DriverNet. Similarly, four ovarian cancer specific CGC genes (BRCA1, BRCA2, CCNE1 and MAPK1)

were predicted in TCGA-OV dataset. CCNE1 was not identified by DriverNet. Five prostate cancer

specific CGC genes (BRAF, ERG, FOXA1, PTEN and SPOP) were predicted in TCGA-PRAD dataset.

BRAF and SPOP were not identified by DriverNet. And seven breast cancer specific CGC genes (BRCA2,

CCND1, CDH1, GATA3, MAP3K1, PIK3CA and TP53) were predicted in TCGA-BRCA dataset. Among

them, CDH1 and MAP3K1 were not identified by DriverNet.

2.4.11 Phenotype classification using CGC gene seeded modules

To evaluate the difference between HIT’nDRIVE predicted driver genes and a list of known driver genes

(from CGC), we performed the following experiments. First, using HIT’nDRIVE-OptDis, we compared

the HIT’nDRIVE driver seeded module with CGC gene seeded module to classify tumor vs normal

samples in TCGA-PRAD patient cohort. Note that among the four TCGA cancer cohorts we study in

this paper, only the PRAD cohort includes non-trivial number of patients with no known driver genes

(based on an unpublished study by PCAWG project) and thus provides a good testbed for novel driver

gene identification by HIT’nDRIVE. As can be seen, HIT’nDRIVE identified driver seeded modules

provide higher classification accuracy, potentially due to novel driver genes identified by HIT’nDRIVE.

The top HIT’nDRIVE modules associated with PRAD are seeded by (in the order of discriminative

ability) ERG, ACAN, FOXA1, ERG, PTEN and CDKN1B (Figure 2.8A). All but ACAN are CGC genes

20

associated with PRAD. HIT’nDRIVE successfully identified all these driver genes without the use of any

information related to known PRAD driver genes from CGC. In addition, HIT’nDRIVE identified ACAN,

a non-CGC gene as a potential driver gene of PRAD. In comparison, the modules identified for CGC

PRAD driver genes were seeded by (again in the order of discriminative ability) ERG, FOXA1, NCOR2,

BRAF, ERG and AR - missing PTEN due to potentially large overlap with other modules. Overall, the

modules seeded by HIT’nDRIVE identified driver genes provide a higher accuracy in discriminating

PRAD than CGC PRAD driver genes.

Next, we compared HIT’nDRIVE driver genes to CGC genes in breast cancer subtypes in TCGA-

BRCA patient cohort. Note that breast cancer is possibly the best studied cancer type with respect to

driver genes Thus it is not surprising that Basal, Her2 and Luminal-B subtypes show negligible dif-

ferentiation between HIT’nDRIVE predictions and CGC based predictions (Figure 2.8B). This is due

to big overlap between HIT’nDRIVE discovered modules and CGC modules (e.g. in BASAL, top 4

HIT’nDRIVE modules almost perfectly match the top 4 CGC modules - which, again, is not surpris-

ing since BRCA is a very well studied cancer with respect to driver genes). However, HIT’nDRIVE

show some advantage in Luminal-A. HIT’nDRIVE outperformed the CGC genes from 43rd module on-

ward. This may be due to HIT’nDRIVE predicted driver genes (seeds) such as DMD, ROCK1, AGAP1,

SHANK2 which are not part of CGC and these genes play important role in cancer.

2.5 DiscussionHere, we have presented a network-based combinatorial method, HIT’nDRIVE, which models the col-

lective effects of sequence altered genes on expression altered genes. HITnDRIVE aims to solve the

“random-walk facility location” (RWFL) problem on a gene/protein interaction network which differs

from the standard facility location problem by its use of “hitting time”, the expected minimum number

of hops in a random-walk originating from any sequence altered gene (i.e. a potential driver) to reach

an expression altered gene, as a distance measure. We introduced the notion of “multi-hitting time” and

presented efficient and accurate methods to estimate it based on single-source hitting time in large-scale

networks. HITnDRIVE reduces RWFL (with multi-hitting time as the distance) to a weighted multi-set

cover problem, which it formulates and solves as an ILP.

As a measure of influence, hitting time - the expected length of a random walk between two nodes, or

its general version, the multi-hitting time, is quite different from the diffusion-based measures or Rooted

PageRank, which are based on asymptotic distributions. We argue that hitting time is a better measure for

our purposes as it is: (i) parameter free (diffusion model introduces at least one additional parameter - the

proportion of incoming flow “consumed” at a node in each time step), (ii) it is time dependent (while the

21

diffusion model and PageRank measures the stationary behavior) and (iii) it is more robust with respect

to small perturbations in the network [74].

In this chapter, we demonstrated the robustness of HIT’nDRIVE to identify cancer driver genes in

multi-omics cancer datasets using a number of different experiments such as - varying the user defined

parameters of HIT’nRIVE, randomizing the input data, randomizing the interaction network, using dif-

ferent interaction networks. We also demonstrated that HIT’nDRIVE is able to capture cancer driver

genes, to a larger extent, in the tumors analyzed. Furthermore, we demonstrated that it is also possi-

ble to perform accurate phenotype prediction for tumor samples by only using HITnDRIVE implied

driver genes and their “network modules of influence” (small sub-networks involving each driver gene

where the aggregate expression profile correlates well with the cancer phenotype) as features, providing

additional evidence that these genes may be driving the cancer phenotype. The network modules we

identified may provide new insights into the biological mechanisms underlying tumor progression.

2.6 Methods

2.6.1 Datasets and Analysis

We used publically available datasets of four major cancer-types glioblastoma multiforme (GBM) [175],

Ovarian serous cystadenocarcinoma (OV) [176], breast adenocarcinoma (BRCA) [177], and prostate

adenocarcinoma (PRAD) [178] from The Cancer Genome Atlas (TCGA) project. All data were ob-

tained from TCGA data-portal in May 2014 which were mapped to GRCh37 genome build. Although

TCGA has recently made available all data re-aligned to the newer GRCh38 genome build, to ensure

compatibility, all TCGA data we have used in this study has been mapped to GRCh37.

Somatic mutation

Somatic mutation calls (level 2 data) from all available platforms/centres were merged. Only missense,

nonsense and splice-site mutations were marked as somatic-mutation alteration events.

Copy number aberrations (CNAs)

CNAs for GBM and OV, Agilent Human Genome CGH Microarray 244A (level 1) data files were used

and for PRAD and BRCA, Affymetrix Genome-Wide Human SNP Array 6.0 (level 3) data files were

used to generate the copy number profiles.

These Agilent FE format sample files were loaded into BioDiscovery Nexus Copy Number software

v7.0, where quality was assessed and data was visualized and analyzed. All samples were mapped to the

22

most recent genome build (hg 19, NCBI build 37) via Agilent probe identifiers and annotation (down-

loaded from Agilent’s website) based on the 1M SurePrint G3 Human CGH Microarray 1x1M design

platform. BioDiscovery’s FASST2 segmentation algorithm, a Hidden Markov Model based approach,

was used to make copy number calls. The FASST2 algorithm, unlike other common HMM methods for

copy number estimation, does not aim to estimate the copy number state at each probe but uses many

states to cover more possibilities, such as mosaic events. These state values are then used to make calls

based on a log-ratio threshold. The significance threshold for segmentation was set at = 5X10-6) also

requiring a minimum of 3 probes per segment and a maximum probe spacing of 1000 between adjacent

probes before breaking a segment. The log ratio thresholds for single copy gain and single copy loss

were set at 0.2 and -0.23, respectively. The log ratio thresholds for two or more copy gain and homozy-

gous loss were set at 1.14 and -1.1 respectively. Upon loading of raw data files, signal intensities are

normalized via division by mean. All samples are corrected for GC wave content using a systematic

correction algorithm. Only the high confidence copy number aberrations i.e. high copy number gain or

homozygous deletions were marked as copy-number aberrant events. Finally, genes that harbour either

a somatic-mutation aberrant event or a copy-number aberrant event were taken to be the final list of

abberant genes at the genomic level.

Gene expression

We used microarray based gene-expression (Affymetrix HT Human Genome U133 Array Plate Set)

(level-1) for GBM and OV data sets. Where as for BRCA and PRAD data sets, RNA-seq derived gene-

expression were used (level-3). Gene expression profiles of normal and tumor phenotype were used as

sample groups.

Gene fusions

Transcript fusions prediction calls for GBM, OV, BRCA and PRAD were obtained from TCGA Fusion

gene Data Portal (http://www.tumorfusions.org) [207]. The fusion partner genes were tagged for gene-

fusion alteration.

2.6.2 Interaction networks

We used STRING version 10 [168] protein-interaction network which contains high confidence func-

tional protein-protein interactions (PPI). Self-loops and interactions with missing HGNC symbols were

discarded and interaction scores were divided by 1000 to obtain percentage-like reliability score. Only

high confidence interactions with combined score of 0.9 or greater were selected. As a result we obtained

23

a network of 10971 nodes with 214298 interactions.

In the case of prostate cancer, we integrated STRING-10 protein-protein interaction network with

protein-DNA interaction network derived from Chip-seq experiments for transcription factors highly

relevant to prostate cancer - REST, FOXA1, AR, EZH2 [150] and ERG [141] resulting in a new combined

network of 13517 nodes and 220190 interactions.

To simulate HIT’nDRIVE using different underlying network we used two additional interaction net-

works: Human Protein Reference Database - Protein-Protein Interaction Database (HPRD-PPI) network

(version 9.0) [134] and REACTOME pathway database (version 2015) [55].

2.6.3 Validation dataset

For the validation of driver-modules we used the following gene-expression datasets: GBM: Murat-

2008 [122], Sun-2006 [164]; OV: Yoshihara-2009 [205], Bowen-2009 [20]; PRAD: Taylor-2010 [169],

Grasso-2012 [66], SMMU-PC [138]; BRCA:METABRIC [42] and Richardson-2006 [140].

2.6.4 Derivation of expression outlier genes

We used generalized extreme studentized deviate (GESD) test [144] to obtain the outlier genes. Unlike

Grubbs test and the Tietjen-Moore test, GESD test only requires that an upper bound for the suspected

number of outliers be specified. Given the upper bound, r, the GESD test essentially performs r separate

tests: a test for one outlier, a test for two outliers, and so on up to r outliers.

2.6.5 Derivation of expression outlier gene weights

Outlier-gene weights were calculated as follows: Let i denote genes, j denote patients and xi j denote the

gene-expression value of gene i in patient j. We then calculated the absolute value of z-score (zi j).

zi j =|xi j−µi|

σi

where, µi and σi respectively denotes mean and standard deviation of expression value of gene i. Next

we performed Student’s t-test in the gene-expression values of normal and tumor phenotypes. where,

ψi =−log(pvaluettest). Finally, we calculate the outlier weight ωi j as

ωi j =ψizi j

∑i

ψizi j

24

2.6.6 Statistical significance of the overlap of driver genes with that of CGC database.

Suppose, for a cohort of cancer patients, we predict ntotal number of driver genes using HIT’nDRIVE, out

of which ncgc number of driver genes are present in the CGC database (of known cancer driver genes).

Let, x be the total number of sequence altered genes (i.e. all potential driver genes) and let y of these x

sequence altered genes be in CGC. This means that the probability that a randomly selected gene out of

these sequence altered genes happens to be a CGC gene is ( yx).

The probability (p-value) that at least ncgc out of ntotal driver genes are identified in CGC is:

pvalue =ntotal

∑i=ncgc

(ntotal

i

)(yx

)i(1− y

x

)ntotal−i

Next we consider driver genes in each patient. We also calculated the p-value for HIT’nDRIVE to

pick at least p CGC drivers out of p′ and pick at most q non-CGC drivers out of q′ as follows

pvalue =x=p′+q′

∑x=p′

(p+q

x

)(p

p+q

)x( qp+q

)p′+q′−x

25

Figure 2.1: Overview of HIT’nDRIVE algorithmic framework. (A) HIT’nDRIVE integratesgenome and transcriptome data obtained from patients’ tumor samples. The red and blue col-ors represent genomic alterations and transcriptomic changes in tumor samples, respectively.The influence values derived from the protein interaction network indicate how likely a drivergene influences its downstream target genes in the network. (B) The predicted driver genes areused as seeds to discover modules of genes that discriminate between the sample phenotypesusing OptDis. (C) Based on this the driver modules are ranked and thus prioritized.

26

Figure 2.2: HIT’nDRIVE identified driver genes with respect to varying parameter values in100 selected BRCA samples. (A-B) The number of driver genes identified by HIT’nDRIVEwith respect to the varying values of (A) γ , and (B) α . (C) The number of driver genesidentified by HIT’nDRIVE with respect to three outlier detection threshold values, acrossvarying values of the γ . (D) Proportion of HIT’nDRIVE detected driver genes obtained foroutlier threshold of 0.01 which are also detected when the outlier threshold is 0.05 and 0.1.(E-F) Driver genes predicted by HIT’nDRIVE in non-randomized data compared with thedriver genes predicted using randomized (i.e. by gene label swapping for 100 iterations). (E)altered genes and (F) outlier genes.

27

Figure 2.3: HIT’nDRIVE identified driver genes with respect to underlying network used in100 selected BRCA samples. (A) Venn Diagram showing the overlap of nodes in the threedifferent networks used - STRING v10 (only high-confident interactions), HPRD v9.0, andREACTOME v2015. (B) Comparison between the number of nodes in the network. (C) Com-parison between the number of edges in the network. (D) Comparison between the numberof driver genes detected using different networks. (E) Proportion of common driver genes be-tween the networks (STRING-REACTOME and HPRD-REACTOME) as compared to drivergenes detected using REACTOME network. (F) HIT’nDRIVE identified driver genes withrespect to network perturbation. The edges of the STRING ver-10 network was perturbedto different extent (between 1-10%) preserving the degree of the nodes in the network. Pro-portion of common driver genes between the unperturbed network and each of the perturbednetwork were calculated.

28

Figure 2.4: Modified HIT’nDRIVE not required to prioritize at least one driver gene per pa-tient. (A) Modified ILP formulation where we removed the constraint that ensured at leastone driver gene is prioritized per patient. (B) HIT’nDRIVE simulation with different valuesof gamma (γ) parameter with the modified ILP formulation as given in A. Each line repre-sents different values of alpha (α) parameter, which controls the fraction of total outliers tobe covered. (C) We calculated the fraction of patients with no driver genes prioritized, for thesame set of driver genes prioritized in B.

29

Figure 2.5: Likelihood of HIT’nDRIVE to capture CGC Genes. (A-D) Sequence-wise alteredCGC genes prioritized by HITnDRIVE v.s. that of non-CGC genes, for each patient sam-ple, across four cancer types. Only CGC genes specific to a cancer type is considered here.Green: Cancer specific sequence-wise altered CGC genes prioritized by HITnDRIVE; Red:Cancer specific sequence-wise altered CGC genes NOT-prioritized by HITnDRIVE; Orange:Sequence-wise altered non-CGC genes prioritized by HITnDRIVE; Purple: Sequence-wisealtered non-CGC genes NOT-prioritized by HITnDRIVE. The right panel depicts absolutenumbers and the left panel depicts relative proportions. As can be seen the likelihood of asequence-wise altered CGC gene to be prioritized by HITnDRIVE is much higher than thatof a non-CGC gene. (E) P-value Distribution of the likelihood of HIT’nDRIVE to pick CGCgenes.

30

Figure 2.6: Correlation between the number of driver genes predicted by HITnDRIVE withmutation rate and copy-number burden (A) Correlation between Mutation rate (frequencyof somatic mutation per Mb) with copy-number burden (percentage of genome changed cal-culated using somatic copy number changes). Correlation of the number of driver genespredicted by HIT’nDRIVE with (B) mutation rate and (C) copy-number burden.

31

Figure 2.7: Phenotype classification using driver-seeded modules. (A) Phenotype (tumor vs nor-mal) classification accuracy in gene-expression datasets of different cancer-types using threedifferent methods - HIT’nDRIVE-unsupervised (left panel), HITn’DRIVE-OptDis (middlepanel) and DriverNet (right panel). (B) Comparison of HIT’nDRIVE with DriverNet.

32

Figure 2.8: Phenotype Classification using CGC Genes Seeded Modules. Phenotype Classifi-cation accuracy of HIT’nDRIVE driver seeded module vs Cancer Gene Census (CGC) genesseeded modules. (A) TCGA-PRAD gene-expression dataset with Tumor and Normal sam-ples. (B) Subtype classification accuracy of HITnDRIVE identified driver seeded modules vsCGC BRCA driver seeded modules on TCGA-BRCA cohort with respect to four subtypes ofbreast cancer (Basal, Her2, Luminal-A and Luminal-B).

33

Chapter 3

Application of HIT’nDRIVE:patient-specific multi-driver geneprioritization for precision oncology

3.1 IntroductionTo demonstrate the utility of the HIT’nDRIVE, we analyzed over 2200 genomes and transcriptomes

(gene expression) of tumors from four major cancer types - glioblastoma, ovarian, breast and prostate

cancer from TCGA project. We present the driver genes obtained by HIT’nDRIVE on this dataset and

explore their functional properties. Many of the HIT’nDRIVE identified driver genes turn out to be

known drivers from the CGC database [59], demonstrating that it is possible to replicate the lengthy and

costly experimental approaches for detecting driver genes in common tumor types by HIT’nDRIVE -

in-silico, strongly supporting the biological relevance of HIT’nDRIVE’s algorithmic framework. This

observation increases our confidence in the calls made by HITnDRIVE in rarer tumor types for which

driver genes are mostly unknown. In fact, the initial results of the PanCancer Atlas project project [12]

reveal that more than 20% of tumors do not have a single (genomically altered) driver gene from CGC.

3.2 Our ContributionsIn this chapter, we used HIT’nDRIVE to identify both known as well as rare (and potentially novel)

patient-specific driver genes on large multi-omics data from different cancer types. We also demonstrate

that by using HIT’nDRIVE-identified driver genes and associated “network modules” (sub-networks

34

seeded by driver genes whose aggregate expression profiles correlate well with the cancer phenotype)

as features, it is possible to perform accurate phenotype classification - as additional evidence that

these genes are likely drivers of the cancer phenotype. We found a number of breast cancer subtype-

specific driver modules that are associated with patients’ survival outcome. Finally, we demonstrate that

HIT’nDRIVE-identified driver genes accurately predict drug efficacy in pan-cancer cell lines.

3.3 Results

3.3.1 HIT’nDRIVE predicts frequent as well as infrequent driver genes in multi-omicscancer datasets

We applied HIT’nDRIVE to prioritize driver genes in four major cancer types - Glioblastoma multiforme

(GBM) [175], Ovarian serous cystadenocarcinoma (OV) [176], Breast adenocarcinoma (BRCA) [177], and

Prostate adenocarcinoma (PRAD) [178] obtained from the TCGA data portal. Only samples with matched

genomic alterations (SNVs and/or CNAs and/or gene fusions) and transcriptomic changes (outlier genes

from gene-expression profile) were used in our study. We used the fusion prediction calls as reported in

the TCGA Fusion gene Data Portal [207].

In GBM, we obtained 48 unique candidate driver genes altered at varying frequencies across 258

GBM patients. EGFR (36%), TP53 (29.5%), PTEN (28%) and CHEK2 (26%) were the most fre-

quently altered driver genes in GBM followed by CDKN2A (16%), RB1 (13%), SEC61G (12%). Previ-

ous efforts in GBM genome characterization identified amplification in EGFR, PDGFRA, mutations in

CHEK2, TP53, PTEN, RB1, NF1 and deletions in CDKN2A to be associated with GBM [128, 175, 193].

HIT’nDRIVE prioritized all of the above alterations. Alterations in EGFR is characteristic of classical

subtype, NF1 with mesenchymal subtype, PDGFRA and IDH1 with pro-neural subtype of GBM [193].

Fifteen out of 48 driver genes predicted by HIT’nDRIVE (p-value = 8X10-4), were present in CGC

database [59], that contains genes for which mutations have been causally implicated in cancer (3.1A).

GSTT1 (deleted in 21 patients), a key player in drug metabolism, was neither found in CGC nor in Cata-

logue of Somatic Mutations in Cancer (COSMIC) [58] databases. Twelve GBM driver genes were found

to be actionable targets. Actionable genes were extracted from TARGET database [187], which contains

genes directly linked to a clinical action. In addition to the above list, 6 other driver genes were druggable

(Figure 3.1B). We extracted the list of druggable genes from Drug-Gene interaction database (DGIDB)

[70]. Interestingly, around 85% of the patients in GBM cohort harbour at least one actionable driver gene

and further 5% of patients have druggable targets (Figure 3.1C). HIT’nDRIVE also identified 12 infre-

quent driver genes, which we define as genes altered in at most 2% of the cases. Among the infrequent

35

genes, SACS is known to be associated with neurological functions, NLRP3 is involved in apoptosis, and

TIAM2 is involved in invasion and metastasis.

The 526 OV patients harboured a total of 85 unique driver alterations . TP53 mutations were preva-

lent in more than half (58%) of the patients in the cohort. Consistent with the previous findings, we

found OV patients to be driven by genomic copy-number changes rather than recurrent point muta-

tions [38, 129]. Recurrent somatic CNAs were observed in GSTT1 (32.3%), WWOX (28.1%), FAM49B

(15.0%), UGT2B17 (14.6%), CCNE1 (13.1%), SLC39A4 (13.1%) and MYC (12.5%). Mutations in

TP53, BRCA1/2 and loss of RB1, NF1 and CCNE1 were previously associated with OV [129, 176].

HIT’nDRIVE revealed 18 CGC driver genes (p-value = 2X10-5) (Figure 3.1A) among which 13 genes

were actionable targets and other 12 genes were at least druggable (Figure 3.1B). More than 75% of OV

patients harboured at least one actionable targets and additional 6% of patients have druggable target

(Figure 3.1C). GSTT1 (altered in 170 patients), in OV, is involved in estrogen and drug metabolism. It

was neither found in CGC nor in COSMIC databases. We identified 13 infrequent genes, among which

MAPK1 is known to play an important role in oncogenic pathways in cancer.

HIT’nDRIVE identified 40 driver genes across 333 PRAD patients Copy number loss of SPECC1L

(23.7%), STEAP1B (13%), WWOX (10%) and amplification of NSD1 (16.2%), SIRPB1 (16.2%) were

the most recurrent events in PRAD patients. We also found recurrent somatic mutation in MUC4

(11%), SPOP (10.5%) and TP53 (10%). The most common alterations in PRAD genomes are fusion

of androgen-regulated promoters with ERG and other members of ETS family of transcription factors

mainly, TMPRSS2-ERG fusions [180]. Since we relied on the gene fusion predictions obtained from

TCGA Fusion gene Data Portal [207] which analyzed only 178 (out of 333) patients, we observed ERG

gene fusion in only 5.7% cases. The more recent TCGA publication [178] reported ERG fusions in

almost half of the patients in the cohort. Moreover, the tools used for gene fusion detection, in the two

studies, were different as a result of which we observed much smaller number of ERG fusions than re-

ported previously. SPOP, TP53, FOXA1 and PTEN are the most frequently mutated genes which have

been previously associated with prostate cancer [13]. PRAD patients harboured 12 driver genes present

in CGC database (p-value = 9X10-4) (Figure 3.1A) out of which 8 driver genes were actionable (Figure

3.1B). Approximately a quarter of PRAD patients could benefit with actionable targeted therapy Figure

(3.1C). Moreover an additional 14% of patients harboured druggable genes which warrants deeper inves-

tigation of drug repurposing opportunities. NBPF1 (mutated in 17 patients), is a known tumor suppressor

gene known to have neural function and also involved in cell-cycle arrest, was neither found in CGC nor

in COSMIC databases. We identified 11 infrequent genes in PRAD among which IDH1 mutant patients

were recently identified as a distinct molecular-subtype of PRAD [178], NKX3-1 is required for normal

prostate tissue development and CDKN1B was previously associated with PRAD.

36

In BRCA, HIT’nDRIVE identified 107 driver genes across 1090 patients Somatic mutation of PIK3CA

(30.5%) and TP53 (30.2%) were the most recurrent events in BRCA. This was followed by somatic mu-

tation of CHD1 (11.2%), GATA3 (10.5%), MUC16 (6.9%), MAP3K1 (6.9%) and CNA amplification of

NSD1 (8.7%) and MED1 (6.9%). BRCA patients harboured 16 genes present in CGC database (p-value

= 9.3X10-3) (Figure 3.1A) among which 10 genes were actionable targets (Figure 3.1B). More than 60%

of BRCA patients could benefit with the actionable targeted therapy. Furthermore, additional 11% of

BRCA patients harboured at least one of the 19 potentially druggable genes (Figure 3.1C). ACACA (al-

tered in 36 patients mostly from HER2 subtype), involved in fatty-acid metabolism, was neither found

in CGC nor in COSMIC databases. We identified 46 infrequent driver genes among which BRCA2 and

GNAS have been previously linked to BRCA.

3.3.2 Network properties of cancer driver genes

Centrality of Driver Genes in the Interactome.

Cancer driver genes are known to occupy critical positions in the interactome. To check whether HIT’nDRIVE

predicted driver genes also occupy similar positions in the interaction network, we used the node degree

as a “local measure”, and node betweenness (the number of shortest paths between node pairs that pass

through the node) as a “global measure” of centrality. The driver genes predicted by HIT’nDRIVE in-

clude a number of well-known high-degree hubs - TP53, EGFR, RB1, MYC, PIK3CA, ERG, CHD1 that

are “central” in the interactome with high degree and high betweenness (Figure 3.2A). Although there

was very weak correlation between the number of edges (i.e. degree centrality) of a node and the number

of samples/patients in which it is identified as a driver, remarkably, each hub gene was typically altered in

a large fraction of patients. Because of their centrality perturbations, hub genes are likely to dysregulate

several other genes and the associated signaling pathways. Interestingly, HIT’nDRIVE also identified

low-degree genes (IDH1, MTAP, NF1, NRG1, NSD1) that reside in the periphery of the interaction net-

work. In particular, in prostate cancer, there seems to be an inverse correlation between the degree and

how often the gene is picked as a driver. Most of these low-degree genes are altered in a small fraction of

patients, indicating that HIT’nDRIVE, unlike many other methods, does not primarily return hubs that

are altered in a large number of patients but is capable of identifying rare driver genes without trivial

topological biases.

37

Influential nodes prioritized as cancer driver genes.

Next we examined the influential driver genes that are responsible for driving cancer. For this, we

computed the total outgoing influence from each altered gene (which has been chosen as a driver), defined

as the weighted sum of all influence values from the source to all outlier genes it is connected to (targets),

weighted by the corresponding outlier weights. First we investigated driver genes with high influence

values within each cancer type. We observed that on average the total influence of driver genes was

higher than that of other altered genes in all cancer types (Figure 3.2D). EGFR, PTEN, CHEK2, TP53

and CDKN2A were the most influential driver genes in GBM which together exerted 38.5% of the total

influence on the GBM patient cohort. In OV, TP53, GSTT1 and MYC together exerted 20% of the total

influence. Similarly, in PRAD cohort, SPOP, MUC4 and TP53 were the most influential genes exerting

23.7% of the total influence. PIK3CA, TP53 and CHD1 were the most influential genes exerting 23% of

the total influence on the BRCA patient cohort. Moreover, the gene influence was positively correlated

to its alteration frequency (Figure 3.2E).

We investigated influence of the predicted driver genes within individual patients. Many recurrently

altered driver genes had higher influence compared to other driver genes. For example, EGFR in GBM;

TP53 in OV; ERG in PRAD; TP53, PIK3CA and PTEN in BRCA.

Interestingly, among the highly influential genes there were also less-recurrent but functionally im-

portant and actionable driver genes. For example, somatic mutations in ABCB1 were influential driver

genes in seven GBM patients (3.2F). ABCB1 is a membrane-bound protein present in the endothelial

cells of the blood-brain barrier. It harnesses the energy of ATP hydrolysis to drive the unidirectional

transport of exogenous and xenobiotic substances (drug compounds) from the cytoplasm to the extra-

cellular space. It is known to transport many anticancer compounds including Temozolomide (TMZ),

which is used as a first-line treatment for GBM patients. Mutations and over-expression of ABCB1 in

GBM have been associated with resistance to TMZ [107]. It was intriguing that some of these GBM

patients had undergone treatment prior to tissue collection and were initially mislabelled as untreated

patients. Treatment-induced selection pressure in the drug transporter might be a plausible reason for

high influence exerted by ABCB1.

Similarly, HIT’nDRIVE predicted BRAF as driver genes in eight PRAD patients (6 somatic muta-

tions and 2 gene-fusions) (Figure 3.2G). These patients harboured BRAF as a highly influential driver

gene. None of these patients harboured BRAFV600E mutation that is prevalent in cutaneous melanomas,

thyroid cancer and many other cancer types. However, BRAFL597R can be targeted using MEK inhibitors

[21, 43]. BRAF plays important roles in growth factor signalling pathways, which affects cell division

and differentiation. These results serve as proof of concept that HIT’nDRIVE can prioritize functionally

38

relevant cancer driver genes.

3.3.3 Breast cancer subtype classification using driver modules.

Our next goal was to classify four major subtypes of breast cancer - Basal, HER2, Luminal-A and

Luminal-B. For that purpose, we performed binary classification for each subtype: e.g. Basal vs non-

Basal (including the normal samples). This was achieved through the use of HIT’nDRIVE-identified

driver genes from TCGA-BRCA as seed genes, with which we identified subtype-specific driver mod-

ules from TCGA-BRCA gene-expression data (as described for tumor classification). We respectively

obtained 37, 16, 43 and 39 subtype-specific driver modules for Basal, HER2, Luminal-A and Luminal-

B subtypes. As described above, using these sub-type specific driver modules as features, we per-

formed independent classification of BRCA subtypes in TCGA-BRCA, METABRIC-Cambridge and

METABRIC-Vancouver datasets [42].

Majority of Basal-like tumors constitute Triple-Negative Breast Cancer (TNBC), which are highly

aggressive tumors characterized by lack of expression of estrogen receptor 1 (ESR1), progestrone recep-

tor (PGR) and erb-b2 receptor tyrosine kinase 2 (ERBB2). Molecular mechanisms driving TNBC are

least understood and hence, no targeted therapies for TNBC yet exists [17]. Interestingly, HIT’nDRIVE

seeded driver modules were able to classify Basal-like tumors with much higher accuracy (98%) as com-

pared to other BRCA-subtypes - HER2 (94%), Luminal-A (85%) and Luminal-B (83%) (Figure 3.3A).

As expected, ESR1 and PGR was highly expressed in Luminal-A/B but not in Basal and HER2 sub-

types. Modules containing ESR1 were consistently down-regulated in Basal subtype and up-regulated in

Luminal-A/B subtype whereas module LUMB-03 was up-regulated in Luminal-B subtype. The ESR1

network neighbourhood included eleven known transcriptional targets of ESR1 (TFF1, PGR, SLC9A3R1,

GNAS, RARA, WWP1, WNT5A, TCF7L2, FKBP4, SPRY2, and RAD54B). These results were consistent

with previous findings [51]. ERBB2 was expressed only in 9 (of 16) HER2 modules and was the most

prominent hub in the large interactome of HER2 modules. All modules containing ERBB2 were up-

regulated in HER2 subtype and module expression pattern were consistent in different BRCA datasets.

PGR was present in 2 modules (BASAL-26 and HER2-12) both of which were down-regulated in Basal

subtype but up-regulated in Luminal-A/B. These results strongly suggest that HIT’nDRIVE can cap-

ture subtype-specific driver genes, and the driver-seeded modules we identified can indeed differentiate

BRCA subtypes.

39

3.3.4 Subtype-specific breast cancer driver modules are associated with survivaloutcome.

To test for association of subtype-specific driver modules with patient survival outcome, we developed

a risk-score defined as a linear combination of the normalized gene-expression values of the component

genes in the module weighted by their estimated univariate Cox proportional-hazard regression coeffi-

cients (see Methods). Based on the risk-score values, patients were stratified into low-risk (risk-score

< 33 percentile) and high-risk (risk-score > 66 percentile) groups. Both Cox regression coefficients of

each gene and risk-score cutoff values for each module were estimated from TCGA-BRCA cohort (train-

ing dataset), later these values were applied to METABRIC cohorts (test dataset). To assess whether the

risk-score assignment to high/low categories was valid, a log-rank test was performed for each module

in both training and test datasets.

We first compared driver-seeded modules against driver-gene-free modules that, according to Opt-

Dis, have the best discriminative score for the TCGA-BRCA dataset. For each module we calculated

three distinct indices: log-rank test pvalue, Hazard Ratio (HR) and Concordance-index (C-INDEX). We

found driver-seeded modules to outperform driver-free modules on all three indices demonstrating that

the driver-seeded modules were better correlated with survival (Figure 3.3B). Motivated by this, we

identified the top modules for each of the BRCA subtypes which do well based on all three indices and

checked whether they can return meaningful results with respect to survival. We found 9 driver mod-

ules significantly associated with patients’ survival outcome (p-value < 0.01, hazard-ratio > 1.5 and

concordance-index > 0.5) in TCGA-BRCA cohort. These 9 modules were also significantly associated

with patient survival outcome (p-value < 0.01) in two additional cohorts (METABRIC cohorts). It is

interesting to note that two of these modules (BASAL-02 and HER2-01) were seeded by an oncogene

- nuclear receptor coactivator 3 (NCOA3) driver gene. NCOA3 driver module was the second-topmost

module (Figure 3.3C) to separate Basal from other subtypes and the top-most module to separate HER2

subtype. NCOA3 driver module was down-regulated in Basal subtype and associated with patients’ over-

all survival (Figure 3.3D-E). A fraction of breast (and ovarian) cancer patients are known to harbour

NCOA3 mutation, amplification or deletion [71]. NCOA3 alone cannot distinguish the basal subtype.

NCOA3 requires other component genes in the module (AR, XBP1, TFF1 and SPDEF) to collectively

distinguish the basal subtype which, as per our knowledge, is a novel finding. However, the interaction

within the module are well known. NCOA3 is a coactivator of steroid hormone receptor, AR and ESR1,

and transcriptional target of XBP1 [71]. NCOA3 is known to stimulate many intracellular signaling

pathways that are critical for cancer proliferation and metastasis. The activity of NCOA3 is known to

be associated with reduced responsiveness to tamoxinfen in patients [126]. SPDEF is associated with

40

regulation of AR activity [100].

3.3.5 HIT’nDRIVE seeded driver genes accurately predict drug efficacy

Next, we obtained somatic mutation, copy number aberration and gene expression data of pan-cancer cell

lines from Genomics of Drug Sensitivity in Cancer (GDSC) project [80]. We used HIT’nDRIVE to iden-

tify driver genes of individual cancer cell lines. Following up on the premise by [80] that potential driver

genes (i.e. cancer genes, which include the CGC genes) alone could predict drug efficacy fairly well, the

predicted driver genes were used as seeds in the network (STRING v10) to identify sub-networks that

discriminate between the drug-response phenotypes (i.e. sensitive vs resistant cell lines). As available

in GDSC, 265 different drug treatments were tested on each cell line provided. We present results for

25 cancer types (the remaining 5 cancer types for which only a very limited number of cell lines are

available are statistically insignificant and thus have not been used).

Perhaps our most interesting result is that, for many drugs, the top HIT’nDRIVE predicted driver

module for cell lines of a specific cancer type (more specifically, OptDis modules seeded by HIT’nDRIVE

identified driver genes, prioritized with respect to drug efficacy) not only includes the drug target but

also the associated (downstream) signaling pathway. As importantly, we measured the accuracy of drug-

response phenotype classification using HIT’nDRIVE-OptDis for each drug-treatment in different cancer

types (Figure 3.4A). In most cancer types, HIT’nDRIVE-OptDis correctly predicted the response to more

than 25% of the drugs in 95% of the cell lines or more. Specifically, Stomach adenocarcinoma (STAD)

and Chronic Myelogenous Leukemia (LCML) are the cancer types with highest fraction of drugs pre-

dicted with an accuracy of 95% or more whereas Liver hepatocellular carcinoma (LIHC) and GBM are

the cancer types with the lowest fraction of drugs predicted with the same accuracy. Below we pro-

vide some of our observations on three well known/promising cancer drugs for which we obtained high

accuracy on cell lines of specific cancer types.

Gefitinib is a clinically approved (for patients with non-small cell lung cancer) protein kinase in-

hibitor which selectively inhibits EGFR. Interestingly, in BRCA, EGFR copy-number amplification or

overexpression primarily activates RAS-RAF-MAPK pathway and PI3K-AKT-mTOR pathway trigger-

ing response for cell proliferation, invasion and survival. Using HIT’nDRIVE, EGFR was found as a

driver gene of BRCA cell lines. Furthermore, EGFR seeded driver module was the second highest scor-

ing module to distinguish the drug-response phenotype increasing the classification accuracy to 98%

(Figure 3.4B,C).

Another example, Nutlin-3a is a promising pre-clinical stage compound which inhibits the interaction

between MDM2 and TP53 inducing apoptosis. MDM2 was predicted as a driver gene in OV cell lines

41

by HIT’nDRIVE. MDM2 seeded module was the top predictor (maximum accuracy 94%) of the drug-

response phenotype when treated with Nutlin-3a (Figure 3.4B,E). Our method predicted many other

interacting partners (both as seed or component genes in the module) of MDM2 and TP53 which are

known to play a critical role in TP53 pathway.

Finally, TMZ is a clinically approved first-line therapy for GBM. ABC transporters (including ABCB1)

help to transport TMZ from the extracellular space to the cytoplasm of a cell. TMZ methylates selective

nucleotides of DNA triggering DNA repair pathway. MGMT specifically removes the methyl groups

from the methylated nucleotides escaping from DNA strand breaks. MGMT was predicted as a compo-

nent gene in the third top-scoring module. Failure to repair DNA strand breaks triggers DNA damage

response pathway further activating TP53 and apoptosis. Interestingly, TP53 was predicted as the seed

of the top scoring module by HIT’nDRIVE-OptDis. Furthermore, another gene in the DNA damage re-

sponse pathway, CDKN2A, seeds another top ranking module, which improves the overall classification

accuracy to 97% (Figure 3.4B,D). Note that both CDKN2A and TP53 are the most frequently altered

genes in GBM.

3.4 DiscussionIn this chapter we have demonstrated that (1) HIT’nDRIVE increases our ability to identify potential ge-

nomic driver alterations. (2) HIT’nDRIVE prioritizes clinically actionable driver genes many of which

happen to be private drivers. This implies that it is possible to replicate the lengthy and costly ex-

perimental approaches for detecting driver genes in common tumor types by HIT’nDRIVE - in-silico,

strongly supporting the biological relevance of HIT’nDRIVE’s algorithmic framework. The fact that a

high portion of HIT’nDRIVE prioritized drivers in well studied cancer types overlap with known driver

genes increases our confidence in the calls made by HIT’nDRIVE in rarer tumor types for which driver

genes are mostly unknown. (3) HIT’nDRIVE prioritizes driver genes present in both the centre and

periphery of an interaction network. (4) Our analysis revealed that driver genes have higher collective

influence on the transcriptome than other altered genes. Some of these driver genes are central and nat-

urally have high influence, however there are also many non-central driver genes with high influence

over other genes in the network. (5) HIT’nDRIVE is especially suitable for identifying such non-central

driver genes or infrequent/private drivers. (6) HIT’nDRIVE can capture subtype specific driver genes

and such driver seeded modules can indeed differentiate between different subtypes of a cancer. (7)

We have demonstrated that subtype specific driver modules are also associated with patients’ survival

outcome providing additional evidence that these driver genes have clinical significance. (8) We also

demonstrated that HIT’nDRIVE seeded driver genes (more specifically, OptDis modules seeded by HIT-

42

nDRIVE identified driver genes, prioritized with respect to drug efficacy) not only include the drug target

but also the associated (downstream) signaling pathway. This provides us the possibility of identifying

and clinically targeting multiple genes (not necessarily sequence-wise altered but are nevertheless in the

module identified by HIT’nDRIVE) dysregulating critical oncogenic or metabolic pathways.

We also note that targeted therapeutics are being extensively used in clinical trials but the drug re-

sponse rate is very poor (only ∼5% of patients in clinical trials have good response to targeted thera-

peutics) [111, 135]. This is most likely because even if a cancer patient harbours an alteration for which

targeted therapeutics are available, we do not know if that alteration is responsible for driving the tumor

[16]. HITnDRIVE could potentially play a key role by prioritizing potential driver alterations from a vast

pool of passenger alterations. In our study, we have used drug efficacy data from pan-cancer cell lines

in order to demonstrate that the potential genomic drivers (more precisely driver gene seeded modules)

of the cell-lines can be used as features to predict drug-efficacy. Following similar procedure in clinical

trials, we believe that the application of HITnDRIVE to predict drug efficacy would likely improve the

drug response rate.

HIT’nDRIVE predicted ABCB1 as the most influential driver gene in seven TCGA-GBM cases

that were treated with TMZ prior to tissue collection. Using GDSC dataset, we demonstrated that

HITnDRIVE-OptDis can predict mechanisms of drug sensitivity for TMZ and other drugs (Figure 3.4G-

H). Since ABCB1 was not mutated in any of the GBM cell lines in the analysis, it was not identified

as a driver gene of GBM cell lines. However, the top seed driver gene, TP53, is an interaction partner

of ABCB1 (in STRING v10 network). Other seed driver genes and its component genes in the module

that are direct interaction partners of ABCB1 are UBC, CAV1, WDTC1 and DNAH8. ABC transporters

(including ABCB1) helps to transport TMZ from the extracellular space to the cytoplasm of a cell. On

the other hand, DNA damage caused by TMZ activates TP53 thereby dysregulating apoptotic pathways.

Thus, the presented analysis demonstrates that the downstream expression changes are, most likely, the

manifestation of the selection pressure in ABCB1 induced by TMZ treatment.

Protein-protein interaction (PPI) networks representing physical interactions now include thousands

of proteins and over a million (undirected) interactions between them. Regulatory networks on the other

hand represent gene/protein regulation occurring at multiple levels of biological systems through directed

links. Since available regulatory networks are very limited in size and scope, our study focuses on PPI

networks. However, HIT’nDRIVE can easily be applied to regulatory networks as they grow in size and

scope. In addition, the use of multi-hitting time as a distance measure between two or more driver genes

and a target gene enables HIT’nDRIVE to capture synthetic rescue like scenarios; this is ideally suited

for undirected PPI networks, but in principle can be extended to regulatory networks in the future.

HIT’nDRIVE is a driver gene prioritization tool that is flexible enough to incorporate different types

43

of -omics data. Both principles under RWFL and HIT’nDRIVE can be utilized to identify the causal

genes in different complex disease facing analogous problems to cancer. Finally, we believe that appli-

cations of RWFL problem may extend beyond its application to driver gene identification - to influence

analysis in social networks, disease networks and others.

3.5 Methods

3.5.1 Datasets and analysis

We used publically available datasets of four major cancer-types GBM [175], OV [176], BRCA [177],

and PRAD [178] from TCGA project. Details can be found in Section 2.6.

3.5.2 Genomics of drug sensitivity in cancer

Somatic mutation, copy-number alterations and gene-expression, and drug screening data of cancer cell

lines were downloaded from Genomics of Drug Sensitivity in Cancer (GDSC) [80] website

http://www.cancerrxgene.org/downloads. Data downloaded on August 2016.

3.5.3 Pathway enrichment analysis

The selected set of genes were tested for enrichment against gene sets of pathways present in Molecular

Signature Database (MSigDB) v5.0 [162]. A Fisher’s exact test based gene set enrichment analysis was

used for this purpose. A cut-off threshold of false discovery rate (FDR)≤ 0.01 was used to obtain the sig-

nificantly enriched pathways. An R implementation of GESD test is available at https://github.com/raunakms/GSEA-

Fisher. Same procedure, as above, is used to assign biological functional to the gene-modules.

3.5.4 Association of driver modules with patients’ survival outcome

To test for association of driver modules with patients’ survival outcome, we developed a risk-score based

on multi-gene (component genes of the module) expression. The risk-score (S) defined as a weighted

sum of the normalized gene-expression values of the component genes in the module weighted by their

estimated univariate Cox proportional-hazard regression coefficients [15] as given in the equation below.

S =k

∑i

βixi j

44

Here i and j represents a gene and a patient respectively, βi is the coefficient of cox regression for gene

i, xi j is the normalized gene-expression of gene i in patient j, and k is the number of component genes

in a gene-module. The normalized gene-expression values were fitted against overall survival time with

living status as the censored event using univariate Cox proportional-hazard regression (Exact method).

Based on the risk-score values, patients were stratified into two groups: low-risk group (patients with

S < 33 percentile of S), and high-risk group (patients with S > 66 percentile of S). Patients that fall in

between (i.e. patients with S >= 33 percentile of S and <= 66 percentile of S) were discarded from the

further analysis as these patients fall into intermediate-risk group and are bound to introduce noise while

performing log-rank test.

Both Cox regression coefficients of each gene and risk-score cutoff values for each module were

estimated from TCGA-BRCA cohort (training dataset), later these values were applied to METABRIC

cohots (test dataset). To assess whether the risk-score assignment to high/low categories was valid, a

log-rank test was performed for each module in both training and test datasets.

Finally, to identify the significant list of driver-modules that were robust enough to predict patients’

survival, we calculated log-rank test pvalue, hazard-ratio (HR) (Wald test) and concordance-index (c-

index) (Wald test).

45

Figure 3.1: Summary of driver genes prioritized by HIT’nDRIVE. (A) Distribution of predicteddriver genes in cancer genes databases. CGC database contains genes for which mutationshave been causally implicated in cancer. Genes curated in CGC database represents likelydrivers of cancer. COSMIC is a comprehensive database of somatic mutations that have beenreported in different cancers. However, every gene present in COSMIC database may notrepresent drivers of cancer. (B) Distribution of driver genes in druggable genes databases.Actionable genes in cancer therapy were derived from TARGET database. List of druggablegenes were extracted from DGI database. (A-B) The numbers in the panel represent thenumber of genes in respective categories. (C) Distribution of patient druggability. Patientdruggability was accessed using information in TARGET and DGI databases. The numbersin the panel represent the number of patients in respective categories.

46

Figure 3.2: Network properties of driver genes. (A) The centrality of the predicted drivers inSTRING v10 network. The size of the circles is proportional to the alteration frequencyof the driver gene. The color scale represents the total influence of the driver gene on theexpression outliers. (B) Correlation between influence and centrality. Each dot represents atarget node receiving certain amount of influence from all source nodes in the network. Alowess regression line is represented in blue. (C) Correlation between incoming and outgoinginfluence of a node. Each dot represents a node in the network and the color scale representsits betweenness centrality. A linear regression line is represented in blue. (D) Boxplot of thetotal influence of driver genes predicted by HIT’nDRIVE on the expression outliers comparedto that of other altered genes (genes not predicted as drivers). (E) Correlation between geneinfluence and its alteration frequency in the respective patient cohort. (F) Relative influence ofdriver genes in each patient in GBM cohort with mutation in ABCB1. (G) Relative influenceof driver genes in each patient in PRAD cohort with mutation in BRAF. (All gene influencevalues have been multiplied by 105 before log transformation.)

47

Figure 3.3: BRCA subtype classification using driver modules. (A) Performance accuracy of clas-sifying different subtypes for breast cancer using activity-score of subtype specific drivermodules as features in three distinct datasets. (B) Box plot comparing subtype specific driver-seeded modules and driver-free modules with respect to three distinct measures - log-rank testpvalue, hazard-ratio (HR) and concordance-index (c-index). (C) A BRCA subtype specificdriver module (BASAL-02) seeded by NCOA3 that distinguished Basal subtype from rest ofthe BRCA subtypes. (D) Activity-score of BASAL-02 module across different BRCA sub-types. (E) Kaplan-Meier plot showing the significant association of BASAL-02 module withpatients’ clinical outcome in the three datasets considered.

48

Figure 3.4: Drug efficacy predicted by HIT’nDRIVE seeded driver genes. (A) Accuracy of drug-response phenotype classification for all 265 drugs used in GDSC study across 25 cancer types(the remaining 5 cancer types for which only a very limited number of cell lines have beenmade available are statistically insignificant and thus have not been used). The classificationaccuracy for each drug on each cancer type is measured based on the collective use of at most10 best discriminating modules, i.e. the accuracy is maximized across the range of 1 to 10(best discriminating) modules. Note that many of the drugs were not tested on all cancer types;in fact for the vast majority of cancer types only a handful of drugs were tested. (B) Classifi-cation accuracy of modules that distinguish the drug-response phenotypes after treatment withGefitinib in BRCA cell-lines (top-panel), Temozolomide in GBM cell-lines (middle-panel),and Nutlin-3a in OV cell-lines (bottom-panel). Important genes identified in the modules andinvolved in the dysregulated signalling pathways have been highlighted. (C-E) The figuresrepresent the dysregulated signalling pathways in the respective drug perturbation.

49

Chapter 4

Integrated multi-omics molecularsubtyping predicts therapeuticvulnerability in malignant peritonealmesothelioma

4.1 IntroductionMalignant mesothelioma is a rare but aggressive cancer that arises from internal membranes lining of

the pleura and the peritoneum. While the majority of mesotheliomas are pleural in origin, peritoneal

mesothelioma (PeM) accounts for approximately 10-20% of all mesothelioma cases. PeM emerges from

mesothelial cells lining of the peritoneal/abdominal cavities. The incidence rate of PeM is estimated

to be less than 0.5 per 100,000 with 400-800 cases reported annually in the United States of America

alone [172]. Occupational asbestos exposure is a significant risk factor in the development of Pleural

Mesothelioma (PM). However, epidemiological studies suggest that unlike PM, asbestos exposure plays

a far smaller role in the etiology of PeM tumors [172].

Mesothelioma is typically diagnosed in the advanced stages of the disease. A combination of Cytore-

ductive surgery (CRS) and Hyperthermic intraperitoneal chemotherapy (HIPEC), sometimes followed by

Normothermic intraperitoneal chemotherapy (NIPEC) has recently emerged as a first-line treatment for

PeM [163]. However, even with this regime, complete cytoreduction is hard to achieve and death ensues

for most patients. Actionable molecular targets for PeM critical for precision oncology remains to be

50

defined. Immune checkpoint blockade therapy in PM has recently caught much attention given 20-40%

of PM cases reported as inflammatory phenotype [174]. Although, clinical trials typically lump PeM and

PM together for immune checkpoint blockade [25–27, 56, 110], no study has yet provided any rationale

why PeM should be considered for immunotherapy.

Studies investigating genetic abnormalities of PeM [5, 34, 83, 85, 99, 151, 158, 184] have revealed

recurrent copy-number losses of CDKN2A on 9p21, NF2 on 22q and BAP1 on 3p21. In addition, these

studies also reported recurrent mutations in BAP1, SETD2, and DDX3X. However, downstream conse-

quences of these genomic alterations in PeM has not been investigated in great detail. Genomic informa-

tion alone is unlikely to successfully uncover candidate therapeutic targets if not analyzed in the context

of transcriptomes and proteomes.

In this study, we performed an integrated analysis of the genome, transcriptome, and proteome of 19

PeM tumors predominantly of epithelioid subtype.

4.2 Our ContributionsWe present a first-in-field comprehensive integrative multi-omics analysis of a patient cohort of treatment-

naive PeM [156]. In a novel contribution, using HIT’nDRIVE, we identified PeM with BAP1 loss to form

a distinct molecular subtype characterized by distinct gene expression patterns of chromatin remodeling,

DNA repair pathways, and immune checkpoint receptor activation. We also demonstrate that this subtype

is correlated with inflammatory tumor microenvironment and thus a candidate for immune checkpoint

blockade therapies. Our findings reveal BAP1 to be a trackable prognostic and predictive biomarker

for PeM immunotherapy that refines PeM disease classification. This is significant because almost half

of PeM cases are now candidates for these therapies. BAP1 stratification may improve drug response

rates in ongoing phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies

in PeM in which BAP1 status is not considered. This integrated molecular characterization provides a

comprehensive foundation for improved management of a subset of PeM patients.

Our another novel and significant contribution is that we resolved the large discordance between

mRNA and protein expression patterns in PeM cohort. Most of this discordance is attributed to chromatin

remodeling genes and proteins linked to multimeric protein complex. The majority of which are direct

protein-interaction partners of BAP1. The discordance between the mRNA and the protein expression

patterns is most likely due to the ubiquitination and degradation of proteins in these BAP1 regulated

complexes to maintain functional stoichiometry.

51

4.3 Results

4.3.1 Patient Cohort description

We assembled a cohort of 19 tumors from 18 patients (here we refer to it as VPC-PeM) undergoing

CRS at Vancouver General Hospital (Vancouver, Canada), Mount Sinai Hospital (Toronto, Canada), and

Moores Cancer Centre (San Diego, California, USA). We obtained 19 fresh-frozen primary treatment-

nave PeM tumors and adjacent benign tissues or whole blood from the 18 cancer patients. For one patient,

MESO-18, two tumors from distinct sites were available. Immunohistochemical staining on tissues using

different biomarkers were evaluated by two independent pathologists. Both pathologist categorized all

19 tumors as epithelioid PeM with a content of higher than 75% tumor cellularity. To the best of our

knowledge this is the largest cohort of PeM subjected to an integrative multi-omics analysis.

4.3.2 Landscape of somatic mutations in PeM

To investigate the heterogeneity of somatic gene mutations in VPC-PeM, we performed high-coverage

exome sequencing (Ion Proton Hi-Q) of 19 tumors and 16 matched normal samples. We achieved a mean

coverage of 180x for cancerous samples and 96x for non-cancerous samples, with at least 43-77% of tar-

geted bases having a coverage of 100x. We identified 346 unique non-silent mutations (313 of which

were not previously reported in COSMIC [58]) affecting 202 unique genes. We observed an average of

0.015 protein-coding non-silent mutations per Mb per tumor sample. Patient MESO-18 had the highest

mutation burden (0.04 mutations per Mb) and MESO-11 had the least mutation burden (0.001 muta-

tions per Mb). The non-silent mutation burden in PeM is low compared to other adult cancers including

many abdominal cancers (Figure 4.1A), with the exception of prostate adenocarcinomas (PRAD), kid-

ney chromophobe carcinomas (KICH), and testicular germ cell tumors (TGCT). Notably, the mutation

burden in PeM was fairly similar to PM as well as pancreatic adenocarcinomas (PAAD). We also as-

sessed the mutational process that contribute to alterations in tumors. Analysis of base-level transitions

and transversions at mutated sites showed that C>T transitions were predominant in PeM (Figure 4.1B).

Using the software deconstructSigs [143], we found that mutational signature 1, 5, 12, and 6 were oper-

ative in PeM. Interestingly, signature 1 was often correlated with age at diagnosis, and signature 6 was

associated with DNA mismatch repair and mostly found in microsatellite instable tumors [7].

We first identified driver genes of PeM using our recently developed algorithm HIT’nDRIVE [155].

Briefly, HIT’nDRIVE measures the potential impact of genomic aberrations on changes in the global

expression of other genes/proteins which are in close proximity in a gene/protein-interaction network. It

then prioritizes those aberrations with the highest impact as cancer driver genes. HIT’nDRIVE priori-

52

tized 25 unique driver genes in 15 PeM tumors for which matched genome and transcriptome data were

available (Figure 4.1C). Six genes (BAP1, BZW2, ABCA7, TP53, ARID2, and FMN2) were prioritized as

drivers, each harboring single nucleotide changes.

The mutation landscape of PeM was found to be highly heterogeneous. The nuclear deubiquitinase

BAP1 was the most frequently mutated gene (5 out of 19 tumors) in PeM tumors. BAP1 is a tumor-

suppressor gene known to be involved in chromatin remodeling, DNA double-strand break repair, and

regulation of transcription of many other genes [19]. Previous studies have also reported BAP1 as the

most frequently mutated gene in both PeM [5, 85] and PM [19, 24]. The BAP1 missense mutation in

MESO-18A/E resulted in a single amino-acid (AA) change in the ubiquitin carboxyl hydrolase domain

keeping the rest of the amino acid chain intact. In MESO-06 and MESO-09, a BAP1 frameshift deletion

resulted in a premature stop codon and chain termination. In MESO-09 approximately 91% of BAP1

amino acid chains were intact, but in MESO-06 only 2% of BAP1 amino acid chains were intact. We also

observed a BAP1 germline mutation in only one case (MESO-09). In three (15%) tumors, we identified

a recurrent R396I mutation in ZNF678 - a zinc finger protein containing zinc-coordinating DNA binding

domains involved in transcriptional regulation. We compared the mutated genes in our VPC-PeM cohort

with publically available datasets [1, 5, 24] of both PeM and PM. BAP1 was the only mutated gene

common between the three PeM cohorts. Twenty-five genes including tumor suppressors LATS1, TP53,

and chromatin modifiers SETD2 were common between at least two PeM cohorts. Many mutated genes

in VPC-PeM were also previously reported in PM. BAP1 and SETD2 were the two mutated genes found

common between VPC-PeM and all four PM cohorts.

4.3.3 Copy number landscape in PeM

To investigate the somatic CNA profiles of PeM, we derived copy-number calls from exome sequencing

data using the software Nexus Copy Number Discovery Edition Version 8.0. The aggregate CNA profile

of PeM tumors is shown in Figure 4.2A-B. We observed a total of 1,281 CNA events across all samples.

On an average, 10% of the protein-coding genome was altered per PeM tumor. MESO-14 had the

highest CNA burden (42%) whereas MESO-11 had the least (0.01%). Interestingly, both mutation and

CNA burden in PeM was strongly correlated (R = 0.74).

We also compared the CNA burden in protein-coding regions of the VPC-PeM cohort with different

adult cancers from TCGA project. Similar to the mutation burden, VPC-PeM tumors were observed

at the lower end of the pan-cancer CNA burden spectrum. Only UCEC, PRAD, and PAAD tumors

had lower median CNA burden as compared to PeM tumors (Figure 4.2C). CNA status and mRNA

expression for around half of the genes were positively correlated (R ≥ 0.1) and 16% of the genes had

53

strong correlation (R ≥ 0.5). To identify cancer genes, we compared aberrations in protein-coding genes

with data from the CGC. Intriguingly, CNA status and mRNA expression for majority of CGC genes

were positively correlated.

To identify recurrent focal CNAs in PeM tumors, we used the GISTIC [115] algorithm which yielded

5 regions of focal deletions (q < 0.05) including in 3p21 and 22q13 which are characteristic of malig-

nant mesotheliomas (Figure 4.2D). Furthermore, GISTIC prioritized 8 regions of focal amplification (q

< 0.05) which included genes such as IGH, VEGFD, BRD9, FOXL1, EGFR, and PDGFA (Figure 4.2D).

Copy-number status of these genes was also significantly correlated with their respective mRNA expres-

sion. Chromosome 1 was the most aberrant region in PeM and chromosomes 13 and 18 were relatively

unchanged except for MESO-14 (Figure 4.2B).

Using HIT’nDRIVE, we identified genes in chromosome 3p21, BAP1, PBRM1, and SETD2, as key

driver genes of PeM (Figure 4.1C). Chromosome 3p21 was deleted in almost half of the tumors (8

of 19) in the cohort. Here, we call tumors with 3p21 (or BAP1) loss as BAP1del and the rest of the

tumors with 3p21 (or BAP1) copy-number intact as BAP1intact. Interestingly, BAP1 mRNA transcripts

in BAP1del tumors were expressed at lower levels as compared to those in BAP1intact tumors (Wilcoxon

signed-rank test p-value = 3x10-4) (Figure 4.2E). We validated this using Immunohistochemical (IHC)

staining demonstrating lack of BAP1 nuclear staining in the tumors with BAP1 homozygous deletion

(Figure 4.2F). Tumors with BAP1 heterozygous loss still displayed BAP1 nuclear staining. We observed

three BAP1 mutated cases among BAP1intact tumors. BAP1 mRNA transcripts in these three tumors,

were expressed at high levels. As mentioned in the previous section, the mutation analysis also predicted

that despite mutation in BAP1 in these three tumors, the entire BAP1 amino-acid chain is still intact

and may be functionally active. Furthermore, we found DNA copy loss of 3p21 locus to include four

concomitantly deleted cancer genes - BAP1, SETD2, SMARCC1, and PBRM1, consistent with [208].

Copy-number status of these four genes was significantly correlated with their corresponding mRNA

expression, suggesting that the allelic loss of these genes is associated with decreased transcript levels.

These four genes are chromatin modifiers, and PBRM1 and SMARCC1 are part of SWI/SNF complex

that regulates transcription of a number of genes.

CNA status of genes associated with key cancer pathways was observed to be different between the

PeM subtypes (i.e. BAP1del and BAP1intact). We observed many genes involved in chromatin remodeling,

SWI/SNF complex and DNA repair pathway to be deleted in BAP1del tumors as compared to BAP1intact

tumors (Figure 4.1C). In contrast, we found copy-number gain of many genes in D NA repair path-

ways (BRCA2, ATM, MGMT, and RAD51) and the cell cycle (MYC, CDK5, CCNB1, and CCND1) in the

BAP1intact tumors. Furthermore, PeM tumors (both BAP1del and BAP1intact) harbored CNA events in car-

cinogenic pathways such as MAPK, PI3K, MTOR, Wnt, and Hippo pathways. Interestingly, ESR1 copy

54

number deletion is enriched in BAP1del tumors while co-amplification of EGFR and BRAF were present

in three BAP1intact tumors. Notably, we identified copy-number loss of tumor suppressor LATS1/2 and

copy-number gain of NF2 in one case, both of which has been previously associated with mesotheliomas

[24], in BAP1del tumors. Notably, both LATS1/2 and NF2 are key regulators of the Hippo pathway [105].

Unsupervised consensus clustering of tumor samples based on copy-number segmentation mean

values of the 3349 most variable genes identified four tumor sub-groups (Figure 4.2G)). We observed

that BAP1del and BAP1intact tumors were grouped into distinct clusters. This indicates that BAP1del

tumors have distinct copy-number profiles from those of BAP1intact tumors. We identified 692 genes

(p-value < 0.01, Kruskal-Wallis test) with significantly differential CNA genes segments between the

clusters. These genes were mapped to eight distinct chromosome loci 19p, 6q, 1q, and, 13q and were

mostly gained in clusters 1 and 3, whereas Xq, 22q, and 7p loss were mostly in clusters 1 and 3.

4.3.4 Gene fusions in PeM

To identify the presence of gene fusions, we analyzed RNA-seq data in 15 PeM using deFuse algorithm

[114]. Overall, 82 unique gene fusion events were identified using our filtering criteria (see Methods),

out of which we successfully validated 18 gene fusions using Sanger sequencing. We observed more

gene fusion events in BAP1del tumors as compared to that in BAP1intact tumors (Figure 4.3A-B).

Notably, BAP1, SETD2, PBRM1, and KANSL1 were prioritized as a driver gene by HIT’nDRIVE on

basis of gene-fusion. Fusions in these genes were mostly found in the BAP1del subtype. MTG1-SCART1

was the most recurrent gene fusion observed in 7 cases. MTG1 regulates mitochondrial ribosome that

synthesize proteins essential for oxidative phosphorylation. SCART1 is a pseudogene predicted to act

as a co-receptor of certain T-cells. This was followed by GKAP1-KIF27 and KANSL1-ARL17B (Figure

4.3C) each of which was identified in 6 different cases. Three unique fusions were present in PBRM1, 2

in KANSL1, and 1 each in BAP1 and SETD2 all of which are involved in chromatin remodeling process

(Figure 4.1C and 4.3D-F).

4.3.5 The global transcriptome and proteome profile of PeM

To segregate transcriptional subtypes of PeM, we performed total RNA-seq (Illumina HiSeq 4000) and its

quantification of 15 PeM tumor samples for which RNA were available (RNA for remaining four tumor

samples did not pass the quality control checks). We first performed principal-component analyses and

unsupervised consensus clustering of all PeM tumors to determine transcriptomic patterns using genes

based on variance among tumor specimens. Consensus clustering revealed two distinct transcriptome

sub-groups (Figure 4.4A). We found BAP1intact and BAP1del have some distinct transcriptomic patterns;

55

however, a few samples showed an overlapping pattern.

We performed mass spectrometry (Fusion Orbitrap LC/MS/MS) with isobaric tagging for expressed

peptide identification and its corresponding protein quantification using Proteome Discoverer for pro-

cessing pipeline for 16 PeM tumors and 7 matched normal tissues (matched normal samples for the

remaining tumors were not available). We identified 8242 unique proteins in 23 samples analyzed (we

were surprised BAP1 protein was however not detected in our MS experiment, likely due to inherent

technical limitations with these samples and/or processing. Quality control analysis of in solution Hela

digests also have very low BAP1 with only a single peptide observed in occasional runs). First, we ana-

lyzed global matched mRNA-protein expression correlation. Although, 58% (4715 of 8109) of proteins

showed positive mRNA-protein correlation (Pearson correlation; R ≥ 0.1), only 22.7% (1839) of the

proteins were strongly correlated with their corresponding mRNA (R ≥ 0.5). Expression of 2.4% (194)

of proteins strongly negatively correlated with their corresponding mRNA (R ≤ -0.5). To analyze the

proteomic pattern across PeM tumors, we performed principal-component analyses and unsupervised

consensus clustering following the same procedure as described above for the transcriptome. Unlike in

transcriptome profiles, the proteome profiles of BAP1 PeM tumor sub-types did not group into distinct

clusters (Figure 4.4B).

To identify Differentially expressed genes (DEG) between BAP1intact and BAP1del, we performed

Wilcoxon signed-rank test using mRNA and protein expression data independently. We identified 1520

and 466 DEG (with p-value < 0.05) using mRNA and protein expression data respectively. However,

only 53 genes were found common between the two sets of DEG. As expected, BAP1, PBRM1 and

SMARCA4, SMARCD3 were among the top-500 DEG. Many other important cancer-related genes were

differentially expressed such as CDK20, HIST1H4F, ERCC1, APOBEC3A, CDK11A, CSPG4, TGFB1,

IL6, LAG3, and ATM.

4.3.6 Transcriptional and post-transcriptional mechanisms regulate chromatinremodeling protein-complexes

Next, we aimed to study the extent to which changes in copy number profile affects its corresponding

protein expression. For this, we calculated Pearson correlation between CNA-mRNA expression and

CNA-protein expression. While, copy number profile of genes, on average, have good agreement with

their corresponding mRNA expression, a number of detected proteins had poor correlation with their

respective gene’s copy number profile. Approximately 25% (1871 of 7462) of proteins were observed

to have poor correlation with their genes copy number which we here define as “attenuated proteins”

(Methods, Figure 4.4C). Among the attenuated proteins, we identified important chromatin remodeling

56

proteins - PBRM1, SETD2, and SMARCC1. The attenuated proteins also included cancer genes such

as NF2, EGFR, APC, PIK3CA, and MAP3K4. We observed that the attenuated proteins were signifi-

cantly enriched with direct protein-protein interaction partners of the UBC (hypergeometric test p-value:

10-5), BAP1 (10-3), and PBRM1 (10-2) in STRING v10 interaction network. Notably, geneset enrich-

ment analysis revealed that attenuated proteins are more likely to form a part of a multimeric complex

or bind to macromolecules (Figure 4.4D). These results corroborate previous findings from studies an-

alyzing breast, ovarian and colorectal cancer datasets [62]. These attenuated proteins were found to be

involved in mRNA processing, DNA repair pathway, cell cycle regulation, the immune system, and in

carbohydrate and lipid metabolism. Strikingly, we found that DEG between the PeM subtypes are sig-

nificantly associated with protein attenuation (Chi-Squared test p-value: 10-4 using mRNA expression

DEG, 10-6 using protein expression DEG). These findings suggest that the effects of CNA are attenuated

at the protein level via post-transcriptional modification.

To identify large protein complexes containing the attenuated proteins and that are variable (i.e. at

least a protein subunit of the complex is differentially expressed) between PeM subtypes, we leveraged a

manually curated set of core protein complexes from the CORUM database [145]. These included many

protein complexes involved in DNA conformation modification, DNA repair, transcriptional regulation,

post-translational modification including ubiquitination. Using our data, we observed that the majority

of the protein complexes were highly co-regulated at the protein level rather than at the mRNA level.

Notably, we identified SWI/SNF (BAF and PBAF) and HDAC complex which were highly co-regulated

(Figure 4.4E-G). We found copy-number deletion in many subunits of SWI/SNF complex, mostly in

the BAP1del subtype (Figure 4.1C). About one quarter of proteins in the BAF complex and half of pro-

teins in PBAF were attenuated. PBRM1 was both attenuated at the protein level as well as differentially

expressed between PeM subtypes. SMARCB1, and SMARCA4 were also differentially expressed be-

tween PeM subtypes in this complex (Figure 4.4H). We further identified a number of HDAC complex

components as highly co-regulated. The complex consisted of Histone deacetylase (HDAC1/2), which

regulates expression of a number of genes through chromatin remodeling. About one-third of protein

subunits in the complex were attenuated at the protein level. More importantly, HDAC1, CHD4 and

ZMYM2 were differentially expressed between PeM subtypes in the protein complex, and different fam-

ily members of HDAC protein family were highly expressed in the BAP1del subtype (Figure 4.4I). This

indicates potential use of HDAC inhibitors to suppress the tumor growth in the BAP1del subtype. We

note that both SWI/SNF and HDAC complexes interact with BAP1. Expression pattern of many subunits

of these complexes were either highly correlated or highly anti-correlated with BAP1 expression (Figure

4.4E-G). Although mRNA transcripts are transcribed proportional to the changes in copy-number profile

of the gene, the corresponding proteins are often stabilized when in complex, and free proteins in excess

57

are usually ubiquitinated and targeted for proteosomal degradation to maintain stoichiometry [62].

4.3.7 BAP1del subtype is characterized by distinct expression patterns of genes involvedin DNA repair pathway, and immune checkpoint receptor activation

To identify the pathways dysregulated by the DEG between the PeM subtypes, we performed hyper-

geometric test based geneset enrichment analysis (Methods) using the REACTOME pathway database.

Intriguingly, we observed high concordance between pathways dysregulated by the two sets (mRNA

and protein expression data) of top-500 DEG (Figure 4.5A-B). The unsupervised clustering of path-

ways revealed two distinct clusters for BAP1del and BAP1intact tumors. This indicates that the enriched

pathways, between the patient groups, are also differentially expressed. BAP1del patients demonstrated

elevated levels of RNA and protein metabolism as compared to BAP1intact patients. Many genes in-

volved in chromatin remodeling and DNA damage repair were differently expressed between the groups.

Our data suggests that BAP1del tumors have repressed DNA damage response pathways. Most impor-

tantly, protein expression data revealed that PARP1 is highly expressed in BAP1del tumors as compared

to BAP1intact tumors indicating potential inhibition of PARP1 for BAP1del tumors. Genes involved in

cell-cycle and apoptotic pathways were observed to be highly expressed in BAP1del patients. Further-

more, glucose and fatty-acid metabolism pathways were repressed in BAP1del as compared to BAP1intact.

More interestingly, we observed a striking difference in immune-system associated pathways between

the PeM subtypes. Whereas BAP1del patients demonstrated strong activity of cytokine signaling and the

innate immune system; MHC-I/II antigen presentation system and Adaptive immune system were active

in BAP1intact patients.

Prompted by this finding, we next analyzed whether PeM tumors were infiltrated with leukocytes. To

assess the extent of leukocyte infiltration, we computed an expression (RNA-seq and protein) based score

using the immune-cell and stromal markers proposed by [206]. We discovered that the immune marker

gene score was strongly correlated with stromal marker gene score (Methods and Figure 4.5C-D). Using

CIBERSORT [124] software, we computationally estimated leukocytes representation in the bulk tumor

transcriptome. We observed massive infiltration of T cells cells in majority of the PeM tumors (Figure

4.5E). A subset of PeM tumors had massive infiltration of B-cells in addition to T cells. Interestingly,

when we group the PeM tumors by their BAP1 aberration status, there was a marked difference in the

proportion of infiltrated plasma cells, natural killer (NK) cells, mast cells, T cells and B cells between

the groups. Whereas the proportions of plasma cells, NK cells and B cells were less in the BAP1del

tumors, there was more infiltration of mast cells and T cells were in BAP1del tumors as compared to

BAP1intact tumors. We performed Tissue microarray (TMA) IHC staining of CD3 and CD8 antibody

58

on PeM tumors. We observed that BAP1del PeM tumors were positively stained for both CD3 and CD8

confirming infiltration of T cells in BAP1del PeM tumors (Figure 4.5F). Combined, this strongly indicates

that leukocytes from the tumor-microenvironment infiltrates the PeM tumor.

Finally, we surveyed the PeM tumors for expression of genes involved in immune checkpoint path-

ways. A number of immune checkpoint receptors were highly expressed in BAP1del tumors relative to

BAP1intact tumors. These included CD274 (PD-L1), CD80, CTLA4, LAG3, and ICOS (Figure 4.5G) for

which inhibitors are either clinically approved or are at varying stages of clinical trials. Gene expres-

sion of these immune checkpoint receptors were highly correlated with immune score (Figure 4.5H).

Moreover, a number of MHC genes, immuno-inhibitor genes as well as immuno-stimulator genes were

differentially expressed between BAP1del and BAP1intact tumors. Furthermore, we analyzed whether the

immune checkpoint receptors were differentially expressed in tumors with and without 3p21 loss in PM

tumors from TCGA. Unlike in PeM, we did not observe a significant difference in immune checkpoint

receptor expression between the PM tumor groups (i.e. BAP1del and BAP1intact). These findings suggest

that BAP1del PeM tumors could potentially be targeted with immune-checkpoint inhibitors while PM

tumors may less likely to respond.

4.4 DiscussionIn this study, we present a comprehensive integrative multi-omics analysis of malignant peritoneal mesothe-

liomas. Even though this is a rare disease we managed to amass a cohort of 19 tumors. Prior studies of

mesotheliomas, performed using a single omic platform, have established loss of function mutation or

copy-number loss of BAP1 as a key driver event in both PeM and PM. Our novel contribution to PeM is

that we provide evidence from integrative multi-omics analyses that BAP1 copy number loss (BAP1del)

forms a distinct molecular subtype of PeM. This subtype of PeM is characterized by distinct expression

patterns of genes involved in chromatin remodeling, DNA repair pathway, and immune checkpoint ac-

tivation. Moreover, BAP1del subtype has inflammatory tumor microenvironment. Our results suggest

that BAP1del tumors might be prioritized for immune checkpoint blockade therapies. Thus BAP1 may

be both a prognostic and predictive biomarker for PeM enabling better disease classification and patient

treatment.

Structural alterations in PeM tumors were found to be highly heterogeneous, and occur at a lower rate

as compared to most other adult solid cancers. The majority of SNVs and CNAs were typically unique

to a patient. However, many of these alterations were non-randomly distributed to critical carcinogenic

pathways. We observed many alterations in genes involved in chromatin remodeling, SWI/SNF complex,

cell cycle and DNA repair pathway. SWI/SNF complex is an ATP-dependent chromatin remodeling

59

complex known to harbor aberrations in almost one-fifth of all human cancers [84]. Our results show that

SWI/SNF complex is differentially expressed between PeM subtypes which further regulates oncogenic

and tumor suppressive pathways. Notably, we also identified another chromatin remodeling complex -

HDAC complex which is differentially expressed between PeM subtypes. HDAC, known to be regulated

by BAP1, is a potential therapeutic target for the BAP1del PeM subtype. Recent in-vitro experiments

demonstrated BAP1 loss altered sensitivity of PM as well as uveal melanoma (UM) cells to HDAC

inhibition [95, 146]).

Loss of BAP1 is known to alter chromatin architecture exposing the DNA to damage, and also im-

pairing the DNA-repair machinery [81, 210]. Similar to BRCA1/2 deficient breast and ovarian cancers,

BAP1 deficient PeM tumors most likely depends on PARP1 for survival. This rationale can be utilized to

test PARP inhibitors in BAP1del PeM subtype. The DNA repair defects thus drive genomic instability and

dysregulate tumor microenvironment [121]. DNA repair deficiency leads to the increased secretion of cy-

tokines, including interferons that promote tumor-antigen presentation, and trigger recruitment of both T

and B lymphocytes to destroy tumor cells. As a response, tumor cells evade this immune-surveillance by

increased expression of immune checkpoint receptors. The results presented here also indicate that PeM

tumors are infiltrated with immune-cells from the tumor microenvironment. Moreover, the BAP1del sub-

type displays elevated levels of immune checkpoint receptor expression which strongly suggests the use

of immune checkpoint inhibitors to treat this subtype of PeM. However, in a small subset of PM tumors

in TCGA dataset, the loss of BAP1 did not elevate expression of immune checkpoint marker genes. This

warrants further investigation on the characteristics of these groups of PM tumors. Furthermore, recently,

BAP1 loss has been defined as a distinct molecular subtype of clear cell renal cell carcinoma (ccRCC)

and UM [33, 132, 142]. These studies showed that, similar to BAP1del PeM subtype, BAP1del tumors

from both ccRCC and UM also have dysregulated chromatin modifiers, impaired DNA repair pathway,

and immune checkpoint receptor activation. More recent studies in ccRCC [116] and melanoma [127]

demonstrated that inactivation of PBRM1 (or PBAF complex) predicts response to immune checkpoint

blocking therapies. Similarly, DNA repair defects have also been shown to be predictive of response to

immune checkpoint blocking therapies [60, 97, 98]. This strongly indicates a pan-cancer mechanism of

oncogenesis shared among tumors with BAP1 copy-number loss.

The main challenge in mesothelioma treatment is that, all current efforts made towards testing new

therapy options are limited to using therapies that have been proven successful in other cancer types,

without a good knowledge of underlying molecular mechanisms of the disease. As a result of sheer

desperation, some patients have been treated even though no targeted therapy for mesothelioma has

been proven effective as yet. For example, a number of clinical trials exploring the use of immune

checkpoint inhibitors (anti-PD1/PD-L1 or anti-CTLA4) in PM and/or PeM patients that progressed under

60

chemotherapy, and are positive for immune checkpoint markers are currently under progress. The results

of the first few clinical trials report either very low response rate or no benefit to the patients [9, 25, 26,

110]. Notably, BAP1 copy-number or mutation status were not assessed in these studies. We believe that

response rates for immune checkpoint blockade therapies in clinical trials for PeM will improve when

patients are segregated by their BAP1 copy-number status.

4.5 Methods

4.5.1 Clinical samples and pathology evaluation

Primary untreated PeM tumors and matched benign samples were obtained from cancer patients under-

going cytoreductive surgeries following protocols approved by the Clinical Research Ethics Board of the

Vancouver General Hospital (Vancouver, BC, Canada), Mount Sinai Hospital (Toronto, ON, Canada),

and Moores Cancer Centre (San Diego, CA, USA). This study was approved by the Institutional Review

Board of the University of British Columbia and Vancouver Coastal Health (REB No. H15-00902 and

V15-00902). All patients signed a formal consent form approved by the respective institutional ethics

board. Histologic parameters and pathological scoring of tumors confirming PeM was established by

three independent pathologists. H&E and immunostained Formalin-Fixed Paraffin-Embedded (FFPE)

slides were reviewed by at least two specialized pathologists to diagnose PeM and its subtype. Hema-

toxylin and eosin (H&E) staining was used to determine the highest tumor cellularity (≥ 75%) from

sections for sequencing. The surgical resections were snap frozen and processed at respective institu-

tions. The tumors have a companion normal tissue specimen (either adjacent normal tissue or peripheral

blood previously extracted for germline DNA control). Each tumor specimen was approximately 1cm3 in

size and weighed between 100-300 mg. Specimen were shipped overnight on dry ice that maintained an

average temperature of less than -80oC. Upon receipt, the tissues were sectioned into 5 slices for DNA,

RNA, and protein extraction as well as construction of TMA.

4.5.2 Construction of tissue microarrays (TMAs)

FFPE tissue blocks were retrieved from the archives of the Department of Pathology, Vancouver General

Hospital (Vancouver, Canada). H&E stained slides from each block were reviewed by two pathologists

to identify tumor areas. TMAs were constructed with 1 mm diameter tissue cores from representative

tumor areas from FFPE blocks. Cores were transferred to a paraffin block using a semi-automated tissue

array instrument (Pathology Devices TMArrayer, San Diego, CA). Duplicate tissue cores were taken

from each specimen, resulting in a composite TMA block. Reactive mesothelial tissues from pleura

61

were also included as benign controls. Following construction, 4µm thick sections were cut for H&E

and immunohistochemical staining.

4.5.3 Immunohistochemistry and Histopathology

Freshly cut TMA sections were analyzed for immunoexpression using Ventana Discovery Ultra au-

tostainer (Ventana Medical Systems, Tucson, Arizona). In brief, tissue sections were incubated in Tris-

EDTA buffer (CC1) at 37C to retrieve antigenicity, followed by incubation with respective primary anti-

bodies at room temperature or 37C for 60-120 min. For primary antibodies, mouse monoclonal antibod-

ies against CD8 (Leica, NCL-L-CD8-4B11, 1:100), CK5/Cytokeratin 5(Abcam, ab17130, 1:100), BAP1

(SantaCruz, clone C4, sc-28383, 1:50), rabbit monoclonal antibody against CD3 (Abcam, ab16669,

1:100), and rabbit polyclonal antibodies against CALB2/Calretinin (LifeSpan BioSciences, LS-B4220,

1:20 dilution) were used. Bound primary antibodies were incubated with Ventana Ultra HRP kit or Ven-

tana universal secondary antibody and visualized using Ventana ChromoMap or DAB Map detection kit,

respectively. All stained slides were digitalized with the SL801 autoloader and Leica SCN400 scanning

system (Leica Microsystems; Concord, Ontario, Canada) at magnification equivalent to x20. The im-

ages were subsequently stored in the SlidePath digital imaging hub (DIH; Leica Microsystems) of the

Vancouver Prostate Centre. Representative tissue cores were manually identified by two pathologists.

4.5.4 Whole exome sequencing

DNA was isolated from snap-frozen tumors with 0.2 mg/mL Proteinase K (Roche) in cell lysis solution

using Wizard Genomic DNA Purification Kit (Promega Corporation, USA). Digestion was carried out

overnight at 55C before incubation with RNase solution at 37C for 30 minutes and treatment with pro-

tein precipitation solution followed by isopropanol precipitation of the DNA. The amount of DNA was

quantified on the NanoDrop 1000 Spectrophotometer and an additional quality check done by reviewing

the 260/280 ratios. Quality check were done on the extracted DNA by running the samples on a 0.8%

agarose/TBE gel with ethidium bromide.

For Ion AmpliSeqTM Exome Sequencing, 100ng of DNA based on Qubit R© dsDNA HS Assay (Thermo

Fisher Scientific) quantitation was used as input for Ion AmpliSeqTM Exome RDY Library Preparation.

This is a Polymerase Chain Reaction (PCR) based sequencing approach using 294,000 primer pairs (am-

plicon size range 225-275 bp), and covers >97% of Consensus CDS (CCDS; Release 12), >19,000

coding genes and >198,000 coding exons. Libraries were prepared, quantified by Quantitative Poly-

merase Chain Reaction (QPCR) and sequenced according to the manufacturer’s instructions (Thermo

Fisher Scientific). Samples were sequenced on the Ion Proton System using the Ion PITM Hi-QTM Se-

62

quencing 200 Kit and Ion PITM v3 chip. Two libraries were run per chip for a projected coverage of

40M reads per sample.

4.5.5 Somatic variant calling

Torrent Server (Thermo Fisher Scientific) was used for signal processing, base calling, read alignment,

and generation of results files. Specifically, following sequencing, reads were mapped against the hu-

man reference genome hg19 using Torrent Mapping Alignment Program. The mean target coverage

ranges from 78.62 to 226.44, thus sequencing depth ranges from 78 to 226X. Variants were identified

by using Torrent Variant Caller plugin with the optimized parameters for AmpliSeq exome-sequencing

recommended by Thermo Fisher. The Variant Calling Format (VCF) files from all sample were com-

bined using GATK (3.2-2) [47] and all variants were annotated using ANNOVAR [197]. Only non-silent

exonic variants including non-synonymous SNVs, stop-codon gain SNVs, stop-codon loss SNVs, splice

site SNVs and In-Dels in coding regions were kept if they were supported by more than 10 reads and

had allele frequency higher than 10%. To obtain somatic variants, we filtered against dbSNP build 138

(non-flagged only) and the matched adjacent benign or blood samples sequenced in this study. Puta-

tive variants were manually scrutinized on the Binary Alignment Map (BAM) files through Integrative

Genomics Viewer (IGV) version 2.3.25 [179].

4.5.6 Copy number aberration (CNA) calls

Copy number changes were assessed using Nexus Copy Number Discovery Edition Version 8.0 (BioDis-

covery, Inc., El Segundo, CA). Nexus NGS functionality (BAM ng CGH) with the FASST2 Segmentation

algorithm was used to make copy number calls (a Circular Binary Segmentation/Hidden Markov Model

approach). The significance threshold for segmentation was set at 5X10-6, also requiring a minimum of

3 probes per segment and a maximum probe spacing of 1000 between adjacent probes before breaking a

segment. The log ratio thresholds for single copy gain and single copy loss were set at +0.2 and −0.2,

respectively. The log ratio thresholds for gain of 2 or more copies and for a homozygous loss were set

at +0.6 and −1.0, respectively. Tumor sample BAM files were processed with corresponding normal

tissue BAM files. Reference reads per CN point (window size) was set at 8000. Probes were normalized

to median. Relative copy number profiles from exome sequencing data were determined by normalizing

tumor exome coverage to values from whole blood controls. The germline exome sequences were used to

obtain allele-specific copy number profiles and generating segmented copy number profiles. The GISTIC

module on Nexus identifies significantly amplified or deleted regions across the genome. The amplitude

of each aberration is assigned a G-score as well as a frequency of occurrence for multiple samples. False

63

Discovery Rate q-values for the aberrant regions have a threshold of 0.15. For each significant region, a

“peak region” is identified, which is the part of the aberrant region with greatest amplitude and frequency

of alteration. In addition, a “wide peak” is determined using a leave-one-out algorithm to allow for er-

rors in the boundaries in a single sample. The “wide peak” boundaries are more robust for identifying

the most likely gene targets in the region. Each significantly aberrant region is also tested to determine

whether it results primarily from broad events (longer than half a chromosome arm), focal events, or

significant levels of both. The GISTIC module reports the genomic locations and calculated q-values for

the aberrant regions. It identifies the samples that exhibit each significant amplification or deletion, and

it lists genes found in each “wide peak” region.

4.5.7 Transcriptome sequencing (RNA-seq)

Total RNA from 100µm sections of snap-frozen tissue was isolated using the mirVana Isolation Kit from

Ambion (AM-1560). Strand specific RNA sequencing was performed on quality controlled high RIN

value (>7) RNA samples (Bioanalyzer Agilent Technologies) before processing at the high throughput

sequencing facility core at BGI Genomics Co., Ltd. (The Children’s Hospital of Philadelphia, Penn-

sylvania, USA). In brief, 200ng of total RNA was first treated to remove the ribosomal RNA (rRNA)

and then purified using the Agencourt RNA Clean XP Kit (Beckman Coulter) prior to analysis with the

Agilent RNA 6000 Pico Chip to confirm rRNA removal. Next, the rRNA-depleted RNA was fragmented

and converted to cDNA. Subsequent steps include end repair, addition of an ‘A’ overhang at the 3’ end,

ligation of the indexing-specific adaptor, followed by purification with Agencourt Ampure XP beads.

The strand specific RNA library prepared using TruSeq (Illumina Catalogue No. RS-122-2201) was

amplified and purified with Ampure XP beads. Size and yield of the barcoded libraries were assessed

on the LabChip GX (Caliper), with an expected distribution around 260 base pairs. Concentration of

each library was measured with real-time PCR. Pools of indexed library were then prepared for cluster

generation and PE100 sequencing on Illumina HiSeq 4000.

4.5.8 Transcriptome (RNA-seq) quantification

Using splice-aware aligner STAR (2.3.1z) [50], RNA-seq reads ( 200MB in size) were aligned onto

the human genome reference (GRCh38) and exon-exon junctions, according to the known gene model

annotation from the Ensembl release 80 (http://www.ensembl.org). Apart from protein coding genes,

non-coding RNA types and pseudogenes are further annotated and classified. Based on the alignment

and by using gene annotation (Ensembl release 80), gene expression profiles was calculated. Only reads

unique to one gene and which corresponded exactly to one gene structure, were assigned to the corre-

64

sponding genes by using the python tool HTSeq [11]. Normalization of read counts was conducted by R

package DESeq [10], which was designed for gene expression analysis of RNA-seq data across different

samples.

4.5.9 Identification of fusion transcripts and validation

We used the deFuse algorithm [114] to predict rearrangements in RNA sequence libraries. The deFuse

fusion transcript prediction calls were further filtered using following criteria: a fusion gene candidate:

(1) must be predicted to have arisen from genome rearrangement, rather than via a readthrough event; (2)

must be predicted in no more than two sequence libraries; (3) must map unambiguously on both sides

of the predicted breakpoints (that is, no multi-mapping reads); (4) must not map entirely to repetitive

elements; (5) must be detected in >5 reads (either split or spanning) and (6) must have at least one of the

fusion partner transcript expressed.

Prioritized putative gene fusions were verified by designing PCR primers around the predicted fusion

sites. Specifically, Reverse Transcription PCR (RT-PCR) was used to amplify the predicted fusion gene

junctions from the same starting RNA material (100ng) as was used for RNA-seq. Two primers (20-22

bp nucleotides) spanning the exon boundary of fused genes were designed using Primer3 (v. 0.4.0) [186].

PCR was performed in 20µl reactions using Q5 buffer (NEB), 0.2mM dNTPs, 0.4 µM each primer, 0.12

units Q5 High-Fidelity DNA Polymerase (NEB) and 2 µl of the RT reaction. The PCR reaction was

carried out with the following program: 95C, 30 seconds, followed by 30 cycles of 95C for 10 seconds,

57C for 20 seconds and 72C for 10 seconds. Resulting PCR products, ranging in size from 150bp

to 250bp, were purified using AMPure beads (Agencourt) and sequenced using Sanger sequencing to

verify fusion junctions.

4.5.10 Proteomics analysis using mass spectrometry

Fresh frozen samples dissected from tumor and adjacent normal were individually lysed in 50mM of

HEPES pH 8.5, 1% SDS, and the chromatin content was degraded with benzonase. The tumor lysates

were sonicated (Bioruptor Pico, Diagenode, New Jersey, USA), and disulfide bonds were reduced with

DTT and capped with iodoacetamide. Proteins were cleaned up using the SP3 method [78, 79] (Single

Pot, Solid Phase, Sample Prep), then digested overnight with trypsin in HEPES pH 8, peptide concentra-

tion determined by Nanodrop (Thermo) and adjusted to equal level. A pooled internal standard control

was generated comprising of equal volumes of every sample (10µl of each of the 100µl total digests)

and split into 3 equal aliquots. The labeling reactions were run as three TMT 10-plex panels (9+IS), then

desalted and each panel divided into 48 fractions by reverse phase HPLC at pH 10 with an Agilent 1100

65

LC system. The 48 fractions were concatenated into 12 superfractions per panel by pooling every 4th

fraction eluted resulting in a total 36 overall samples.

These samples were analyzed with an Orbitrap Fusion Tribrid Mass Spectrometer (Thermo Fisher

Scientific) coupled to EasyNanoLC 1000 using a data-dependent method with synchronous precursor

selection MS3 scanning for TMT tags. A short description follows; more detailed overview is in [79].

Briefly, an in house packed reverse phase column run with a 2 hour low pH acetonitrile gradient (5-40%

with 0.1% formic acid) was used to separate and introduce peptides into the MS. Survey scans covering

m/z 350-1500 were acquired in profile mode at a resolution of 120,000 (at m/z 200) with S-Lens RF

Level of 60%, a maximum fill time of 50 milliseconds, and Automatic Gain Control (AGC) target of

4x105. For MS2, monoisotopic precursor selection was enabled with triggering charge state limited to 2-

5, threshold 5x103 and 10 ppm dynamic exclusion for 60 seconds. Centroided MS2 scans were acquired

in in the ion trap in Rapid mode after CID fragmentation with a maximum fill time of 20 milliseconds and

1 m/z isolation quadrupole isolation window, c ollision energy of 30%, activation Q of 0.25, injection for

all available parallelizable time turned ON, and an AGC target value of 1x104. For MS3, fragment ions

were isolated from a 400-1200 m/z precursor range, ion exclusion of 20 m/z low and 5 m/z high, isobaric

tag loss exclusion for TMT, with a top 10 precursor selection. Acquisition was in profile mode with the

Orbitrap after HCD fragmentation (NCE 60%) with a maximum fill time of 90 milliseconds, 50,000 m/z

resolution, 120-750 m/z scan range, an AGC target value of 1x105, and all available parallelizable ON.

The total allowable cycle time was set to 4 seconds.

4.5.11 Peptide identification and protein quantification

Qualitative and quantitative proteomics analysis was done using ProteomeDiscoverer 2.1.1.21 (Thermo

Fisher Scientific). To maintain consistency with transcriptome annotation, we used Ensembl GRCh38.87

human reference proteome sequence database for proteome annotation. Sequest HT 1.3 was used for

Peptide Spectral Matches (PSM), with parameters specified as trypsin enzyme, two missed cleavages

allowed, minimum peptide length of 6, precursor mass tolerance 10 ppm, and a fragment mass toler-

ance of 0.6 Da. We allowed up to 4 variable modifications per peptide from the following categories:

acetylation at protein terminus, methionine oxidation, and TMT label at N-terminal residues and the side

chains of lysine residues. In addition, carbamidomethylation of cysteine was set as a fixed modification.

PSM results were filtered using q-value cut off of 0.05 to control for FDR determined by Percolator.

Identified peptides from both high and medium-confidence level after FDR-filtering were included in

the final stage to provide protein identification and quantification results. Reporter ions from MS3 scans

were quantified with an integration tolerance of 20ppm with the most confident centroid. Proteins were

66

further filtered to include only those found with minimum one peak in all samples. Proteome Discoverer

processed data was exported for further statistical analysis.

4.5.12 Mutational signature analysis

We used deconstructSigs [143], a multiple regression approach to statistically quantify the contribution

of mutational signature for each tumor. The mutational signature were obtained from the COSMIC

mutational signature database [8]. Both silent and non-silent somatic mutations were used together

to obtain the mutational signatures. Only mutational signatures with a weight more than 0.06 were

considered for analysis.

4.5.13 Prioritization of driver genes using HIT’nDRIVE

Non-silent somatic mutation calls, CNA gain or loss, and gene-fusion calls were collapsed in gene-patient

alteration matrix with binary labels. Gene-expression values were used to derive expression-outlier gene-

patient outlier matrix using GESD test. STRING ver10 [167] protein-interaction network was used to

compute pairwise influence value between the nodes in the interaction network. We integrated these

genome and transcriptome data using HIT’nDRIVE algorithm [155]. Following parameters were used:

α=0.9, β=0.6, and γ=0.8. We used IBM-CPLEX as the ILP solver.

4.5.14 Consensus clustering

We used ConsensusClusterPlus [199] R-package to perform consensus clustering. We used the following

parameters: maximum cluster number to evaluate: 10, number of subsamples: 10000, proportion of items

to sample: 0.8, proportion of features to sample: 1, cluster algorithm: hierarchical, distance: pearson.

4.5.15 Protein attenuation analysis

For every gene/protein profiled for CNA (segment mean), RNA-seq (normalized log2 expression), and

MS (normalized log2 expression), we performed the following analysis. For every gene/protein, the

Pearson correlation coefficients were calculated for CNA-mRNA expression (RCNA:mRNA) and CNA-

protein expression (RCNA:protein). The 75th percentile of the difference between the above two correlation

coefficients i.e. Rdiff = RCNA:mRNA−RCNA:protein was found to be approximately 0.45. Therefore those

proteins with Rdiff ≥ 0.45 were considered as attenuated proteins.

67

4.5.16 Pathway enrichment analysis

The selected set of genes were tested for enrichment against gene sets of pathways present in Molecular

Signature Database (MSigDB) v6.0 [162] A hypergeometric test based gene set enrichment analysis

was used for this purpose (https://github.com/raunakms/GSEAFisher). A cut-off threshold of FDR <

0.01 was used to obtain the significantly enriched pathways. Only pathways that are enriched with at

least three differentially expressed genes were considered for further analysis. To calculate the pathway

activity score, the expression dataset was transformed into standard normal distribution using ‘inverse

normal transformation’ method. This step is necessary for fair comparison between the expression-

values of different genes. For each sample, the pathway activity score is the mean expression level of the

differentially expressed genes linked to the enriched pathway.

4.5.17 Stromal and immune score

We used two sets of 141 genes (one each for stromal and immune gene signatures) as described in [206].

We used ‘inverse normal transformation’ method to transform the distribution of expression data into the

standard normal distribution. The stromal and immune scores were calculated, for each sample, using

the summation of standard normal deviates of each gene in the given set.

4.5.18 Enumeration of tissue-resident immune cell types using mRNA expressionprofiles

CIBERSORT algorithm [124] was applied to the RNA-seq gene-expression data to estimate the propor-

tions of 22 immune cell types (B cells naive, B cells memory, Plasma cells, T cells CD8, T cells CD4

naive, T cells CD4 memory resting, T cells CD4 memory activated, T cells follicular helper, T cells

gamma delta, T cells regulatory (Tregs), NK cells resting, NK cells activated, Monocytes, Macrophages

M0, Macrophages M1, Macrophages M2, Dendritic cells resting, Dendritic cells activated, Mast cells

resting, Mast cells activated, Eosinophils, and Neutrophils) using LM22 dataset provided by CIBER-

SORT platform. Genes not expressed in any of the PeM tumor samples were removed from the LM22

dataset. The analysis was performed using 1000 permutation. The 22 immune cell types were later

aggregated into 11 distinct groups.

4.5.19 External datasets

TCGA datasets for 16 different cancer-types used in this study were downloaded from the National

Cancer Institute-Genomic Data Commons (NCI-GDC; https://portal.gdc.cancer.gov/) on February 2017.

For somatic mutation data, non-silent variant calls that were identified by at least three out of four dif-

68

ferent tools (MUSE, MuTect2, SomaticSniper and VArScan2) were considered. CNA segmented data

were further processed using Nexus Copy Number Discovery Edition Version 9.0 (BioDiscovery, Inc.,

El Segundo, CA) to identify aberrant regions in the genome. In case of the RNA-seq expression data,

HTSeq-FPKM-UQ normalized data were used.

69

Figure 4.1: Landscape of somatic mutations in PeM tumors. (A) Comparison of somatic muta-tion rate in protein-coding regions of PeM with different adult cancers obtained from TCGA.(B) Mutational signature present in PeM (top panel). Proportional contribution of differ-ent COSMIC mutational signature per tumor sample. (C) Somatic alterations identified inPeM tumors group by important cancer-pathways. LUSC: Lung Squamous Cell Carcinoma,LUAD: Lung adenocarcinoma, BLCA: Urothelial Bladder Carcinoma, COAD: Colorectalcarcinoma, UCEC: Uterine Corpus Endometrial Carcinoma, OV: Ovarian cancer, KRIP: Kid-ney renal papillary cell carcinoma, KIRC: Kidney Renal Clear Cell Carcinoma, UCS: Uter-ine Carcinosarcoma, GBM: Glioblastoma Multiforme, BRCA: Breast Invasive Carcinoma,MESO-PM: Malignant Pleural Mesothelioma, MESO-PeM: Malignant Peritoneal Mesothe-lioma, PAAD: Pancreatic Adenocarcinoma, PRAD: Prostate Adenocarcinoma, KICH: KidneyChromophobe, TGCT: Testicular Germ Cell Tumor.

70

Figure 4.2: Landscape of copy number aberrations in PeM tumors. (A) Aggregate copy-number alterations by chromosome regions in PeM tumors. Important genes with copy-number changes are highlighted. (B) Sample-wise view of copy-number alterations in PeMtumors. (C) Comparison of copy-number burden (considering protein-coding regions only)in PeM with respect to other adult cancers obtained from TCGA. (D) Highly aberrant ge-nomic regions in PeM prioritized by GISTIC. (E) mRNA expression pattern of BAP1 acrossall PeM samples. The Wilcoxon signed-rank test p-value for BAP1 mRNA expression com-pared between the PeM subtypes is indicated in the box. (F) Detection of BAP1 nuclearprotein expression in PeM tumors by immunohistochemistry (Photomicrographs magnifica-tion - 20x). (G) Unsupervised consensus clustering of tumor samples based on copy-numbersegmentation mean values of the 3349 most variable genes.

71

Figure 4.3: Gene fusions in PeM. (A-B) Circos plot showing the gene fusion events identifiedin PeM tumors. (A) BAP1intact subtype (B) BAP1del subtype. (C-F) Few selected gene fu-sion events identified in PeM tumors. The top and middle panel shows the chromosomeand the transcripts involved in the gene fusion event. The bottom panel shows the RNA-seq read counts detected for the respective transcripts. (C) KANSL1-ARL17B fusion, (D)PBRM1-ADGB fusion, (E) SETD2-CHP1 fusion, and (F) PHF7-PBRM1 fusion. (G-J) Thechromatogram showing the Sanger sequencing validation of the fusion-junction point.

72

Figure 4.4: Transcriptome and proteome profile of PeM. (A-B) Principal component analysisof PeM tumors using (A) transcriptome profiles and (B) proteome profiles. (C) Effects ofCNA on transcriptome and proteome. In the scatterplot, each dot represents a gene/pro-tein. The horizontal and vertical axes represent Pearson correlation coefficient between CNA-transcriptome and CNA-proteome respectively. Key cancer genes that undergo protein atten-uation have been highlighted. (D) Geneset enrichment analysis of attenuated proteins againstgene ontologies (left panel) and Reactome pathways (right panel). (E-G) CORUM core pro-tein complexes regulated by PBRM1 and/or BAP1. The nodes represent individual proteinsubunit of the respective complex. The node color represents correlation of mRNA expressionof respective gene with BAP1. The border color of the node indicates whether the respectiveprotein is attenuated or not. The edge represents interaction between the protein subunits.The edge information were extracted from STRING v10 PPI network. The edge color (andedge thickness) represents correlation of protein expression between the respective interactionpartners. (E) SWI/SNF complex B (PBAF), (F) SWI/SNF complex A (BAF), and (G) HDACcomplex. (H-I) mRNA and protein expression level differences between PeM subtypes. (H)SWI/SNF complex and (I) HDAC complex. The expression levels are log2 transformed andmean normalized.

73

Figure 4.5: Immune cell infiltration in PeM tumors. (a-b) Pathways enrichment of top-500 dif-ferentially expressed genes between PeM subtypes obtained using (a) mRNA expression and(b) protein expression. (c-d) Correlation between immune score and stromal score derivedfor each tumor sample using (c) mRNA expression and (d) protein expression. (e) Estimatedrelative mRNA fractions of leukocytes infiltrated in PeM tumors based on CIBERSORT anal-ysis. (f) CD3 and CD8 immunohistochemistry showing immune cell infiltration on BAP1del

PeM tumor (Photomicrographs magnification - 20x). (g) mRNA expression differences inimmune checkpoint receptors between PeM subtypes. The bar plot on the right representsnegative log10 of Wilcoxon signed-rank test p-value computed between PeM subtypes. (h)Correlation between immune score and mRNA expression of immune checkpoint receptors.The expression levels are log2 transformed and mean normalized

74

Chapter 5

Combinatorial detection of conservedalteration patterns for identifying cancersubnetworks

5.1 IntroductionRecent large scale pan-cancer sequencing projects have revealed multitude of somatic genomic, tran-

scriptomic, proteomic and epigenomic alterations across cancer types. However, a tumor is likely driven

by selected few alterations that provide evolutionary advantage to the tumor, hence called “driver” al-

terations [195]. Distinguishing driver alterations from functionally inconsequential random “passenger”

alterations is critical for therapeutic development and cancer treatment.

It is well evident that, except for few cases, cancers are often driven by multiple driver genes [12,

155]. Whereas emergence of alterations is likely a consequence of endogenous or exogenous mutagen

exposures [7], their evolutionary selection depends on the functional role of the affected genes [195] and

synergistic combinations of different alterations. For example, TMPRSS2-ERG gene fusion is considered

as an early driver event in almost half of prostate cancer cases, and it often co-exists with copy-number

deletions of PTEN as well as NKX3-1 to drive cancer progression [31, 90, 93]. Recently, concomitant

deletion of four cancer genes - BAP1, SETD2, PBRM1, and SMARCC1 in chromosome locus 3p21 has

been identified as a driver event in a fraction of clear cell renal cell carcinoma (ccRCC) [33], uveal

melanoma [142], and mesotheliomas [208]. These genes are involved in chromatin remodeling process,

and their loss further impairs DNA damage repair pathway in the aberrant tumors [142].

75

Co-occurring alterations might be evolutionary selected because alteration in one gene might en-

hance the deleterious effect of the other [28]. Such co-selected genes are often a part of a functionally

interacting driver subnetwork (or pathway) that are observed together in the same tumor, and define

its phenotype. In fact, as demonstrated by the pancancer and other large scale sequencing efforts, co-

occurring genomic and transcriptomic alterations in specific tumor types are commonly shared across a

large fraction of patients. Thus efficient computational methods that can identify large subsets of func-

tionally interacting (genomic or transcriptomic) alterations, highly conserved across specific tumor types,

are in high demand.

5.2 Literature ReviewRecently, a number of computational methods have been developed to identify recurrent genomic (as well

as transcriptomic) alteration patters across tumor samples. Some of these methods have been designed

to identify multiple gene alterations simultaneously, based on their co-occurrence or mutual exclusivity

relationships in a tumor cohort, without any reference to a molecular interaction network [45, 89, 118].

Other approaches have been developed with the aim of identifying a specific subnetwork within a molec-

ular interaction network, either through (i) a combinatorial formulation, with the goal of maximizing the

total weight of the subnetwork in a molecular interaction network with node (and possibly edge) weights

[53, 108], or (ii) a network diffusion process to derive specific mutated pathways [102, 189]. A direction

particularly relevant to our study is motivated by [6, 88, 185, 189], and explored by Bomersbach et al.

[18], which proposed an alternative formulation for finding a subnetwork of a given size k with the goal

of minimizing h, the number of samples for which at least one gene of the subnetwork is in an altered

state. (A similar formulation where the goal is to maximize a weighted difference of k and h, for varying

size k, can be found in [76].) Although the above combinatorial problems are typically NP-hard, they

became manageable through the use of state of the art ILP solvers or greedy heuristics, or by the use of

complex preprocessing procedures.

Complementary to the ideas proposed above, there are also several approaches to identify mutually

exclusive (rather than jointly altered) sets of genes and pathways [37, 117, 190]. These approaches utilize

the mutational heterogeneity prevalent in cancer genomes, and are driven by the observation that muta-

tions acting on same pathway are many times mutually exclusive across tumor samples. Although, from

a methodological point of view, these approaches are very interesting, they are not trivially extendable

to the problem of identifying co-occurring alteration patterns (involving more than two genes) conserved

across many samples.

76

5.3 Our ContributionsIn this chapter, we present a novel computational method, cd-CAP (combinatorial detection of Conserved

Alteration Patterns), that primarily uses an ILP formulation to identify subnetworks of an interaction

network, each with an alteration pattern conserved across (a large subset of) a tumor sample cohort.

Some of the previous methods described above, attempt to solve a variant of the problem but do so by

considering only a single network and using binary labeled genes – indicating whether the gene is altered

or not. Unlike these approaches, our method simultaneously identifies more than one subnetwork, and

each gene within each subnetwork has labels specific to the alteration types it harbors. In fact, we allow

a gene to have more than one label, each corresponding to a specific alteration type: somatic mutation,

copy number alteration, or aberrant expression. From this point on we will refer to each distinct alteration

type as a specific “color” of the corresponding node in an interaction network.

The algorithmic framework of cd-CAP consists of two major steps. The first step is an exhaustive

search method (a variant of the a-priori algorithm) that was originally designed for association rule min-

ing [3]. This step computes the set of all “candidate” subnetworks (each with a distinct color assignment)

of size at most k shared among at least t samples (both k and t are user defined parameters). cd-CAP

provides the user the additional options that (i) at least two distinct colors should be present in the col-

oring of a subnetwork, or (ii) each sample network can include up to a fraction δ of nodes whose color

assignment differ from that of the “template”. cd-CAP also gives the user to stop at this point and provide

(a) the largest colored subnetwork that appears in at least t samples (we report on some results obtained

with this option), or (b) the colored subnetwork of size k that is shared by the largest number of samples.

Alternatively, the second step solves the maximum conserved subnetwork cover problem which asks to

cover the maximum number of nodes in all samples with at most l colored subnetworks (l is user defined)

- obtained in the first step - via ILP.

We have applied cd-CAP - with each of the possible options above, i.e., (i), (ii), (a) and (b) - to

TCGA breast cancer (BRCA), colorectal adenocarcinoma (COAD), and glioblastoma (GBM) datasets,

which collectively include over 1000 tumor samples. cd-CAP identified several connected subnetworks

of interest, each exhibiting specific gene alteration pattern across a large subset of samples.

In particular, cd-CAP results with option (i) demonstrated that many of the largest highly conserved

subnetworks within a tumor type solely consist of genes that have been subject to copy number gain, typ-

ically located on the same chromosomal arm and thus likely a result of a single, large scale amplification.

One of these subnetworks cd-CAP observed (in about one third of the COAD samples [170]) include 9

genes in chromosomal arm 20q, which corresponds to a known amplification recurrent in colorectal tu-

mors. Another copy-number gain subnetwork cd-CAP observed in breast cancer samples correspond to

77

a recurrent large scale amplification in chromosome 1 [42]. It is interesting to note that cd-CAP was able

to re-discover these events without specific training.

Several additional subnetworks identified by option (i) solely consist of genes that are aberrantly ex-

pressed. Further analysis with options (ii) and (b) of cd-CAP revealed subnetworks that capture signaling

pathways and processes critical for oncogenesis in a large fraction of tumors. We have also demonstrated

that the subnetworks identified through all three options of cd-CAP are associated with patients’ survival

outcome and hence are clinically important.

In order to assess the statistical significance of subnetworks discovered by cd-CAP - option (a), we

introduce for the first time a model in which likely inter-dependent events, in particular amplification or

deletion of all genes in a single chromosome arm, are considered as a single event. Conventional models

of gene amplification either consider each gene amplification independently [36] (this is the model we

implicitly assume in our combinatorial optimization formulations, giving a lower bound on the true p-

value), or assumes each amplification can involve more than one gene (forming a subsequent sequence of

genes) but with the added assumption that the original gene structure is not altered and the duplications

occur in some orthogonal “dimension” [54, 148, 211]. Both models have their assumptions that do not

hold in reality, but inferring evolutionary history of a genome with arbitrary duplications (that convert

one string to another, longer string, by copying arbitrary substrings to arbitrary destinations) is NP-hard

and even hard to approximate [40, 123]. By considering all copy number gain or loss events in the same

chromosomal arm as a single event, we are, for the first time, able to compute an estimate that provides

an empirical upper bound to the statistical significance (p-value) of the subnetworks discovered. (Note

that this is not a true upper bound since a duplication event may involve both arms of a chromosome -

but that would be very very rare.) Through this upper bound, together with the lower bound above, we

can sandwich the true p-value and thus the significance of our discovery.

5.4 Algorithmic Framework of cd-CAP

5.4.1 Combinatorial Optimization Formulation

Consider an undirected and node-labeled graph G = (V,E), representing the human gene or protein

interaction network, with n nodes where v j ∈V represent genes and e=(vh,v j)∈E represent interactions

among the genes/proteins. Let us assume that we have m copies of the original network G, where each

copy represents an individual sample Pi in a cohort. In each network Gi = (V,E,Ci) corresponding to

sample Pi, each node vi, j (as a copy of v j) is colored with one or more possible colors to form the set

Ci, j (i.e. Ci maps vi, j to a possibly empty subset of colors Ci, j). Each color represents a distinct type

78

of alteration harbored by a gene/protein, in particular somatic mutation (single nucleotide alteration or

short indel), copy number gain, copy number loss or significant alteration in expression (which can be

trivially expanded to include genic structural alteration - micro-inversion or duplication, gene fusion,

alternative splicing, methylation altearation, non-coding sequence alteration) observed in the gene and

the protein product. Without loss of generality, Ci, j = /0 implies none of the possible alteration events are

observed at vi, j, and two nodes vi, j,vi′, j corresponding to each other in two distinct samples have at least

one matching color if Ci, j ∩Ci′, j 6= /0.

The main goal of cd-CAP is to identify conserved patterns of (i.e. identically colored) connected

subnetworks across a subset of sample networks Gi. Consider a connected subnetwork T = (VT ,ET ) of

the original interaction network G, where each node v j ∈ VT is assigned exactly one color c j. Such a

colored subnetwork is said to be shared by a collection of sample networks Gi(i ∈ I) if each node of the

subnetwork harbor the same color in every sample network i.e. c j ∈⋂

i∈I Ci, j for each v j ∈VT . A colored

node in a sample network is said to be covered by a subnetwork if the subnetwork is shared by the node’s

sample network (Fig. 5.1). Intuitively, a colored subnetwork represents a conserved pattern or a network

motif.

cd-CAP combinatorially formulates the problem of identifying conserved patterns of subnetworks

as the Maximum Conserved colored Subnetwork Identification problem (MCSI). Here the goal is

to find the largest connected subnetwork S of the interaction network G, that occur in exactly t (a user

specified number) samples P , such that each node in S has the same color in each sample Pi(∈P). Note

that this formulation is orthogonal to that used in [18] and [76], where the goal is to maximize the number

of samples that share a fixed size subnetwork. The advantage of formulating the problem as MCSI is that

it naturally admits a generalization of the a-priori algorithm. We also note that our formulation considers

distinct types of mutations (as colors) in the conserved alteration patterns, another key improvement to

that used in [18, 76].

cd-CAP also supports simultaneous identification of multiple conserved subnetworks that are altered

in a large number of samples. In one potential formulation of the problem one may aim to cover all

nodes vi, j in all m input sample networks Gi, with the smallest number of subnetworks T = (VT ,ET )∈T

shared by at least one sample network. We refer this combinatorial optimization problem as MinimumSubgraph Cover Problem for (Node) Colored Interaction Networks (MSC-NCI).

One advantage of the MSC-NCI problem is that it is parameter-free. However, in a realistic multi-

omics cancer dataset, the number of genes far exceeds the number of samples represented. Under such

conditions, the solution to the MSC-NCI problem will primarily include subnetworks that are large con-

nected components that are shared by only one sample network. To account for this situation, we intro-

duce the following parameters/constraints akin to those for the MCSI formulation: (1) we require that

79

the nodes in each subnetwork have the same color shared by at least t samples (in the remainder of the

discussion, t is referred to as depth of a subnetwork); and (2) we require that each subnetwork returned

contains at most k nodes. Note that this variant of the problem is infeasible for certain cohorts (consider

a particular node which has a unique color for a particular sample; clearly requirement 1 can not be sat-

isfied if t > 1). Even if there is a feasible solution, the requirement that each subnetwork in T is of size

at most k makes the problem NP-hard (the reduction is from the problem of determining whether G can

be exactly partitioned into connected subnetworks, each with k nodes [52]). As a result (3) we introduce

one additional parameter, l, the maximum number of subnetworks (each of size at most k, and which

are color-conserved in at least t samples) with the objective of covering the maximum number of nodes

across all samples. We call the problem of identifying at most l subnetworks of size at most k, whose

colors are conserved across at least t samples, so as to maximize the total number of nodes in all these

samples covered by these subnetworks, as the Maximum Conserved Subnetwork Coverage problem(MCSC).

5.4.2 Algorithmic Framework for solving MCSC

We formulate the MCSC problem (as well as MSC-NCI problem) as an ILP. A straightforward applica-

tion of available ILP solvers can only handle relatively small instances of the MSC-NCI problem. This

is because the number of variables and the number of constraints for the MSC-NCI ILP formulation are

O(n2m2) and O(n2m3) respectively, both very large for a typical problem instance. Fortunately, in all

instances of interest, only a limited number of genes are colored in comparison to the total number of

nodes nm. This enables us to apply an exhaustive search method that is designed for association rule

mining [3] to build a list of all candidate subnetworks exactly and efficiently (e.g. in comparison to the

ILP or heuristic solutions in [18, 76]) and then solve the MCSC on the set of candidate subnetworks1.

Generating Conserved Subnetworks

We generate the complete list of candidate subnetworks with minimum depth t by the use of “anti-

monotone property” [103]: if any subnetwork S has depth < t, then the depth of all of its supergraphs

S′ ⊃ S must be < t. This makes it possible to grow the set S of valid subnetworks comprehensively

but without repetition (as described as “optimal order of enumeration” in [113]) through the following

breadth-first network growth strategy.

1. For every colored node vi, j and each of its colors c`, we create a candidate subnetwork of size 1

1 Note that our exhaustive search method is an extension of the a-priori algorithm with the difference that we require thecandidate subnetworks to maintain connectivity as they grow.

80

containing the node with color c`. All samples in which the node is colored c` naturally share this

trivial subnetwork.

2. We inductively consider all candidate subnetworks of size s with the goal of growing them to

subnetworks of size s+1 as follows. For a given subnetwork T of size s, consider each neighboring

node u. For each possible color c′` of u, we create a new candidate subnetwork of size s+ 1 by

extending T with u - with color c′`. We maintain this subnetwork for the next inductive step only

if the number of samples sharing this new subnetwork is at least t; otherwise, we discard it.

During the extension of T above, if the new node u does not reduce the number of samples sharing it, T

becomes redundant and is not considered in the ILP formulation.

Solving MCSC

Given the universe U = vi, j |Ci, j 6= /0 , i = 1, · · · ,m; j = 1, · · · ,n, containing all the coloured nodes in

all the sample networks, and the collection of all subnetworks

S = Ti |Ti shared by at least t samples and contains at most k nodes

our goal is to identify up to l subnetworks from the collection S whose union contains the maximum

possible number of elements of the universe U .

After the list of all candidate subnetworks S is constructed (as described in the previous subsection),

we represent the MCSC problem with the following ILP and solve it using IBM-CPLEX or Gurobi. A

binary variable C[i, j] corresponds to whether colored node vi, j was covered by at least one chosen sub-

network, and binary variable X [i] corresponds to whether colored candidate subnetwork Ti was one of

the chosen. Let Si, j represent the set of all subnetworks of S which contain node vi, j properly colored

in them.

Maximize ∑vi, j∈U

C[i, j]

s.t. ∑Tp∈Si, j

X [p]≥C[i, j] (∀vi, j ∈U )

∑Ti∈S

X [i]≤ l

81

Special Types of Conserved Subnetworks.

In addition to the exactly-conserved colored subnetworks obtained through the general MCSC formula-

tion, we also consider two important variants.

1. Colorful Conserved Subnetworks. A colorful subnetwork T is one that has at least two distinct

colors represented in the coloring of its nodes, i.e. c`,ch ∈⋂

v j∈T C j (c` 6= ch). In some of the

datasets that we analyzed, certain colors were dominant in the input to such extent that all subnet-

works identified by our method had all nodes colored the same. By restricting focus to colorful

subnetworks, it is possible, e.g., to capture conserved patterns of potential driver alterations and

their impact on their vicinity in the interaction network, in the form of expression alterations. In

order to identify the maximalcolorful conserved subnetwork of a given depth t in the tumor sam-

ples, we only need to keep track of the colorful subnetworks in each iteration - since any colorful

network must contain a connected colorful subnetwork.

2. Subnetworks Conserved within error rate δ . In order to reduce the sensitivity of our method to

noise (or lack of precision in generating the data) in the input when detecting conserved patterns,

we extend our formulation to allow some “errors” in identifying conserved subnetworks. We

define δ , the error rate of a colored subnetwork T as the maximum allowable fraction of nodes of

T without an assigned color in any sample Pi that shares T . For tolerating an error rate of δ , we

extend our algorithm to generate candidate subnetworks S for the MCSC problem by performing

a post-processing step in which the list of samples sharing subnetwork T is increased by including

all samples that share T with an error rate of δ . (Note that our notion of error is restricted to nodes

that do not have a color, i.e. an observed alteration, in each specific sample.)

5.5 Results

5.5.1 Dataset Used

We obtained somatic mutation, copy number aberration and RNA-seq based gene-expression data from

three distinct cancer types - glioblastoma multiforme (GBM) [175], breast adenocarcinoma (BRCA)

[177], and colon adenocarcinoma (COAD) [170] from The Cancer Genome Atlas (TCGA) datasets.

In addition, we distinguish four commonly observed molecular subtypes (i.e. Luminal A, Luminal B,

Triple-negative/basal-like and HER2-enriched) from the BRCA cohort. For each sample, we obtained

the list of genes which harbor somatic mutations, copy number aberrations, or are expression outliers as

per below.

82

Somatic Mutations. All non-silent variant calls that were identified by at least one tool among MUSE,

MuTect2, SomaticSniper and VarScan2 were considered.

Copy Number Aberrations. CNA segmented data from NCI-GDC were further processed using Nexus

Copy Number Discovery Edition Version 9.0 (BioDiscovery, Inc., El Segundo, CA) to identify aberrant

regions in the genome. We restricted our analysis to the most confident CNA calls selecting only those

genes with high copy gain or homozygous copy loss.

Expression outliers. We used HTSeq-FPKM-UQ normalized RNA-seq expression data to which we

applied the GESD test [144]. In particular, we used GESD test to compare the transcriptome profile of

each tumor sample (one at a time) with that from a number of available normal samples. For each gene,

if the tumor sample was identified as the most extremely deviated sample (using critical value α = 0.1),

the corresponding gene was marked as an expression-outlier for that tumor sample. This procedure was

repeated for every tumor sample. Finally, comparing the tumor expression profile of these outlier genes

to the normal samples, their up or down regulation expression patterns were determined.

5.5.2 Maximal Colored Subnetworks Across Cancer Types

We used cd-CAP to solve the maximum conserved colored subnetwork identification problem exactly

in (each one of the four) protein-interaction network(s) on each cancer type - for every feasible value

of network depth. As can be easily observed, the depth and the size of the identified subnetwork are

inversely related. We say that a given value of the network depth is feasible if (i) the depth is at least

10% of the cohort size, (ii) the maximum network size for that depth is at least 3, (iii) the number of

“candidate”subnetworks are at most 2M per iteration when running cd-CAP for that depth.

The number of maximal solutions of cd-CAP as a function of network depth for each cancer type

(COAD, GBM, BRCA Luminal A, and BRCA Luminal B) is shown in figure 5.2A-D on STRING v10

PPI network with high confidence edges. In general, for a fixed network size, the number of distinct

networks of that size decreases as the network depth increases. One can observe the “valleys” in the

colored plots in figure 5.2A-D which correspond to the largest depth that can be obtained for a given

subnetwork size. Throughout the remainder of the paper we focus on the colored subnetworks of each

given size for which the network depth is maximum possible - which correspond to the valleys in the

plots. If for a given subnetwork size and the corresponding maximal depth, cd-CAP returns more than 1

subnetwork, we discard those solutions.

Most of the subnetworks, especially those with large depth, identified for each of the four cancer

types consisted of expression outlier genes (typically all upregulated or all downregulated) only (fig-

ure 5.2A-D). As the network depth decreases, maximal subnetworks that consist only of copy number

83

variants emerge. One of the most prominent copy-number gain subnetworks of the COAD dataset has

depth 163 out of 463 patients in the cohort. This network forms the core of the larger maximal subnet-

works cd-CAP identifies for lower depth values; it corresponds to a copy number gain of the chromo-

somal arm 20q - a known copy number aberration pattern highly specific to colorectal adenocarcinoma

tumors [170].

Another subnetwork cd-CAP identified in 15% of the 422 BRCA Luminal-A samples corresponds

to a copy number gain on chromosome 1, which is again a known aberration associated with breast

cancer [42]. With increasing depth, the maximal subnetworks cd-CAP identifies in Luminal A cohort

start to consist solely of expression outlier genes. In particular cd-CAP identified a subnetwork of eight

underexpressed genes with network depth 90 (Fig. 5.2E) - consisting of genes EGFR, PRKCA, SPRY2,

and NRG2, known to be involved in EGFR/ERBB2/ERBB4 signaling pathways (Fig. 5.2F). EGFR is an

important driver gene involved in progression of breast tumors to advanced forms [171] and its altered

expression is observed in a number of breast cancer cases [42]. The subnetwork also included MET,

another well-known oncogene [119], and is enriched for members of the Ras signaling pathway, which

is also known for its role in oncogenesis and mediating cancer phenotypes such as over-proliferation

[57].

In order to test for the association between the subnetworks identified by cd-CAP and patient survival

outcomes, we used a risk-score defined as a linear combination of the normalized gene-expression values

of the genes in the subnetwork weighted by their estimated univariate Cox proportional-hazard regression

coefficients (see Methods section for details). Based on the risk-score values, the patients covered by the

subnetwork were stratified into two risk group. Luminal A subnetwork was the most significant among

all subnetworks identified in this dataset (Fig. 5.2G). The patients in the high-risk group have poor

overall survival outcome suggesting clinical importance of the identified subnetwork by cd-CAP.

As another example, we identified a colored subnetwork with copy number gain genes that covered

163 patients in the COAD dataset (Fig. 5.2H). The genes in this subnetwork belong to the same chro-

mosome locus 20q13, suggesting that they may comprise a single region of chromosomal amplification.

Intriguingly all the members forms a linear pathway-like structure also on the PPI level. Among them

is a group of functionally related genes consisting of transcription factors and their regulators (genes

CEBPB, NCOA’s, UBE2’s), which are known to be involved in the intracellular receptor signaling path-

way (Fig. 5.2I). CEBPB and UBE2’s are also involved in the regulation of cell cycle [82]. To the other

end of the linear subnetwork, we found MMP9 and SDC4, the established mediators of cancer invasion

and apoptosis [30, 82]. Also we confirmed that this set of genes are highly predictive of the patients’

survival outcome (Fig. 5.2J). These results support the functional importance and clinical relevance of

the subnetwork we identified.

84

5.5.3 Maximal Colorful Subnetworks Across Cancer Types

We next used cd-CAP to solve the maximum conserved colored subnetwork identification problem - with

at least two distinct colors (see Section 5.4.2 for details), in each of the four protein-interaction network(s)

and on each cancer type. Again, cd-CAP was run with every feasible value (as defined above) of network

depth. The number of maximal solutions of cd-CAP as a function of network depth for each cancer type

(COAD, GBM, BRCA Luminal A, and BRCA Luminal B) is shown in figure 5.3A-D on STRING v10

PPI network with high confidence edges. Note that we distinguish here the maximal subnetworks with

one or two sequence-level alterations (i.e. somatic mutations and copy number alterations) – which

is of potential interest since their neighboring expression-level alterations are possibly caused by these

sequence-level alterations (figure 5.3E provides an example) – with all the other cases. Similarly, we

only focus on the maximal colorful subnetworks of every possible size for which the network depth is

maximum possible and discard the solutions when cd-CAP returns more than 1 colorful subnetworks for

each feasible value of network depth.

One colorful COAD subnetwork of note is composed of overexpressed genes with an additional

copy number gain gene that covers 108 patients (Fig. 5.3E). This subnetwork is mainly enriched for

genes involved in ribosome biogenesis (Fig. 5.3G). Cancer has been long known to have an increased

demand on ribosome biogenesis [120], and increased ribosome generation has been reported to contribute

to cancer development [131]. The biological relevance of this subnetwork is also supported by survival

analysis, which shows a strong differentiation between the high-risk and low-risk groups - see figure

5.3F.

Another colorful subnetwork we observed in 58 BRCA Luminal A samples consists of four copy

number gained genes, an overexpressed gene, and two underexpressed genes, including EGFR (Fig.

5.3H). All copy-number gained genes and the overexpressed gene are located in chromosome 1q, com-

monly reported in breast cancer [42]. The subnetwork involves an interesting combination of the down-

regulation of the cancer gene EGFR and the amplification of a group of genes involved in T-cell receptor

signaling (PTPRC, CD247, and ARPC5; see figure 5.3I). Thus we may surmise that the covered popula-

tion of patients potentially have relatively low cancer proliferation index with higher anti-tumor immune

response, which can be highly relevant indicators with regard to clinical outcome. Indeed, this subnet-

work is significantly associated with patients’ survival (Fig. 5.3J).

5.5.4 Multiple-Subnetwork Analysis Across Cancer Types

We next sought to detect up to 5 subnetworks per cancer type that collectively cover maximum possible

number of colored nodes by solving the MCSC problem on STRING v10.5 network (with experimentally

85

validated edges). The subnetwork extension error rate was set to 20%, and we restricted the search space

to subnetworks which do not consist only of expression outlier nodes, in order to obtain what we believe

to be more biologically interesting results. Parameter t was chosen for each dataset in a way that made it

possible to construct all candidate subnetworks of maximum possible size while keeping the total number

of candidate subnetworks below 2×106, making the problem solvable in reasonable amount of time. We

set t to 69 (15% of the patients), 62 (10% of the patients), and 110 (10% of the patients) respectively for

COAD, GBM, and BRCA datasets. Table 5.1 shows the size, per sample depth and the coloring of the

nodes in the resulting subnetworks.

We note that the subnetworks identified in the GBM dataset had the lowest depth (10-15% of the

samples). COAD and BRCA datasets on the other hand have much larger depth (respectively 30-48% and

15-32% of the samples). Smaller subnetworks of the GBM dataset solely consist of copy number gain

genes on chromosome 7q, a known amplification in GBM [22]. The two large subnetworks each contain

a single gene with copy number gain (SEC61G and EGFR, respectively) accompanied by several of

overexpressed genes. BRCA dataset exhibits a similar pattern: each of the four large subnetworks contain

a single copy number gain gene from chromosome 8q, (NSMCE2 in one and MYC in the remaining

three subnetworks). Subnetworks detected in COAD dataset were much more colorful and recurrently

conserved in a larger fraction of samples than those in the other datasets. All genes with copy number

gain are located in chromosome 20q.

We identified a subnetwork with 15 nodes (11 genes with copy number gain, 1 overexpessed and

2 underexpressed genes) in 149 COAD patients (Fig. 5.4A). All 11 copy number gain genes belong to

chromosome 1q. IL6R, PLCG1, PTPN1, and HCK are involved in cytokine/interferon signaling to acti-

vate immune cells to counter proliferating tumor cells [160] (Fig. 5.4B). UBE2I, AURKA, and MAPRE1

are involved in cell cycle processes. This subnetwork was found to be associated with patients’ survival

outcome (Fig. 5.4C).

We identified another subnetwork with 15 nodes (14 overexpressed and 1 copy number gain genes)

in 313 breast cancer patients (Fig. 5.4D). Genes in this subnetwork are involved in cell cycle processes

(Fig. 5.4E). In particular the cell cycle checkpoint processes were dysregulated - which is known to drive

tumor initiation processes [194]. The subnetwork was found to be associated with patients’ survival

outcome (Fig. 5.4F) demonstrating its clinical relevance.

86

5.5.5 Empirical P-Value Estimates Confirm the Significance of cd-CAP IdentifiedNetworks

To evaluate the significance of cd-CAP’s findings, we performed the permutation test in Section 5.7.1

1000 times on each cancer type for each setting of subnetwork constraints. Figure 5.5 demonstrates

the distribution of the empirical p-value estimates. (The lower bound results look similar to what is

presented in the figure and thus are omitted.) In the permutation tests all cd-CAP identified subnetworks

(without additional constraints) of size 2-5 were composed solely of expression altered genes; in contrast

there are several larger CNV rich subnetworks observed in the TCGA COAD data set and others, further

confirming the significance of our findings. Colorful subnetworks presented in Figure 5.5 are even less

likely to occur at random (we therefore omit empirical p-value estimates for the networks in Figure 5.5).

5.6 DiscussionIn this study, we introduce a novel combinatorial framework and an associated tool named cd-CAP

which can identify (one or more) subnetworks of an interaction network where genes exhibit conserved

alteration patterns across many tumor samples. Compared with the state-of-the-art methods (e.g.[6,

76]), cd-CAP differentiates alteration types associated with each gene (rather than relying on binary

information of a gene being altered or not), and simultaneously detects multiple alteration type conserved

subnetworks.

cd-CAP provides the user with two major options. (a) It computes the largest colored subnetwork that

appears in at least t samples. This option exhibits significant speed advantage over available ILP-based

approaches; its a-priori based algorithmic formulation allows flexible integration of special constraints

(on maximal subnetworks) – not only simplifying complicated ILP constraints, but also further reducing

the number of candidate subnetworks in iteration steps (a good example for this is the “colorful con-

served subnetworks” as introduced in Section 5.4.2). However, the identified subnetworks are required

to be conserved, i.e., each node only admits one alteration type among the samples sharing it (although

we have relaxed constraints that allow each sample to have a few nodes without any alterations, i.e. col-

ors). In the future, we may extend the definition of a network to include nodes with color mismatches

(for example, according to the definition in [6] or [185]) with a modification to cd-CAP’s candidate sub-

network generation algorithm. (b) It solves the maximum conserved subnetwork cover (MCSC) problem

to cover the maximum number of nodes in all samples with at most l colored subnetworks (l is user de-

fined) via ILP. In the future, we aim to refine the MCSC formulation with reduced number of parameters

and hope to develop exact or approximate solutions.

Subnetworks identified by cd-CAP in COAD, GBM and BRCA datasets from TCGA are typically

87

enriched with genes harboring gene-expression alterations or copy-number gain. Notably, we observed

that genes in subnetworks with copy-number amplification are universally located in the same chromoso-

mal locus. Many of these genes have known interactions and are functionally similar, demonstrating the

ability of cd-CAP in capturing functionally active subnetworks, conserved across a large number of tu-

mor samples. These subnetworks seem to overlap with pathways critical for oncogenesis. In the datasets

analyzed, we observed cell cycle, apoptosis, RNA processing, and immune system processes that are

known to be dysregulated in a large fraction of tumors. cd-CAP also captured subnetworks relevant to

EGFR/ERBB2 signaling pathways, which have distinct expression patterns in specific subtypes of breast

cancer [42, 133]. Survival analysis of cd-CAP identified subnetworks also confirmed their substantial

clinical relevance.

5.7 Methods

5.7.1 Significance of the Identified Subnetworks

Under the assumption that each gene is altered independently, it is possible to apply the conventional

permutation test [18, 89, 190] to assess the statistical significance of the subnetworks identified by cd-

CAP as follows. Let Ci = (vi, j,c) : c ∈Ci, j 6= /0,vi, j ∈V be a binary relation representing the existing

colors on each node of sample network Gi. A permuted copy of the interaction network G′i = (V,E,C′i)

is generated (under the null hypothesis) by randomly shuffling the range of C , such that each node

vi, j takes a new set of colors C′i, j with the total number of colors ∑ j |Ci, j| in Gi preserved. (In other

words, ∑ j |Ci, j| = ∑ j |C′i, j|, and a simple implementation assigns |C′i, j| by random shuffling (|Ci, j| : j =

1,2, · · · ,n). An entire set of permuted sample networks consists of each randomly generated G′i, and this

permutation test is repeated sufficiently many (by default 1000) times. For a particular size k subnetwork

T = (VT ,ET ) identified by cd-CAP (on t samples) we define P1 as the fraction of these permutation tests

where any subnetwork of size at least k appear in t or more samples.

In fact, P1 presents a lower bound on the p-value for T since it ignores the inter-dependency of

node colors (gene alteration events). In particular, whole chromosome or chromosome arm level copy

number amplifications/deletions are commonly observed in various cancer types. To address this issue,

we apply the following procedure to calculate P2 as an empirical upper-bound for the p-value of T ,

under the assumption that copy number alterations take place in whole chromosome arms. First we

identify all genes v j ∈V on the same chromosome arm, chr(v j) and construct a set of supernodes Uchri =

chr(vi, j) : ∃c, (vi, j,c) ∈ Ci from the genes on the same chromosome arm for each sample Pi. Let

NE = |(vi, j,E) ∈ Ci| denote the number of nodes with color E (corresponding to either a copy number

88

gain or loss) in sample Pi. Then, each supernode is assigned the color E independently with probabilityNE

|Ci|, which guarantees that the expected count of E in Pi is preserved. Finally we randomly assign the

remaining colors to those nodes without a color assignment thus far, to obtain a new randomly permuted

interaction network G′′i = (V,E,C′′i ) towards an empirical p-value (upper bound) estimate. We again

repeat this process sufficiently many (by default 1000) times to generate distinct permuted datasets and

derive P2 by counting the fraction of these datasets where any subnetwork of size at least k appear in

t or more samples. The true statistical significance is expected to be in the range [P1,P2] provided that

chromosome arms form the largest units of alteration.

5.7.2 Pathway enrichment analysis

The set of genes in the subnetwork were tested for enrichment against gene sets of pathways present in the

Molecular Signature Database (MSigDB) v6.0 [162]. A hypergeometric test based gene set enrichment

analysis [162] was used for this purpose. A cut-off threshold of false discovery rate (FDR) ≤ 0.01 was

used to obtain the significantly enriched pathways.

5.7.3 Association of sub-networks with patients’ survival outcome

In order to assess the association of identified subnetworks with patients’ survival outcome, we used

a risk-score based on the (weighted) aggregate expression of the genes in the subnetwork. The risk-

score (S) of a patient is defined as the sum of the normalized gene-expression values in the subnetwork,

each weighted by the estimated univariate Cox proportional-hazard regression coefficient [15], i.e., S =

∑ki βixi j. Here i and j represents a gene and a patient respectively, βi is the coefficient of Cox regression

for gene i, xi j is the normalized gene-expression of gene i in patient j, and k is the number of genes in the

subnetwork. The normalized gene-expression values were fitted against overall survival time with living

status as the censored event using univariate Cox proportional-hazard regression (exact method). Based

on the risk-score values, patients were stratified into two groups: low-risk group (patients with S < mean

of S), and high-risk group (patients with S ≥ mean of S). Note that only those patients that are covered

by the subnetwork are considered for the analysis above.

89

Figure 5.1: Schematic overview of cdCAP. Multi-omics alteration profiles of a cohort of tumorsamples are identified using appropriate bioinformatics tools. The alteration information iscombined with gene-level information in the form of a sample-gene alteration matrix. Eachalteration type is assigned a distinct color. Using a (signaling) interaction network, cd-CAPidentifies subnetworks with conserved alteration patterns.

90

Table 5.1: Five subnetworks identified by cd-CAP in multi-subnetwork mode for each cancer type:respective columns below depict the subnetwork size, depth, and the number of nodes in thesubnetwork with copy number amplification (AMP), expression increase (EXP-UP) or decrease(EXP-DOWN).

Cancer Network# Size Depth AMP EXP-UP EXP-DOWN1 6 206 1 5 02 11 152 6 5 0

COAD 3 12 137 7 3 24 15 149 11 1 35 15 223 2 10 31 4 72 4 0 02 4 69 4 0 0

GBM 3 9 67 9 0 04 16 70 1 15 05 36 96 1 32 31 8 164 7 0 12 10 332 1 9 0

BRCA 3 11 360 1 10 04 15 313 1 14 05 15 335 1 14 0

91

Figure 5.2: Conserved colored subnetworks. (A-D) Number of maximal solutions and the sizeof the conserved colored subnetwork obtained using the MCSI formulation, as a function ofnetwork depth t, in each of four cancer types analyzed, on STRING v10 (with high confidencenodes) PPI network . The horizontal axis denotes the depth (number of patients) of the net-work. For the blue plot, the vertical axis denotes the maximum possible network size (in termsof the number of nodes) and thus it is strictly non-increasing by definition. For the plots withdifferent colors, the vertical axis denotes the number of distinct networks with network sizeequal to that indicated by the blue plot. (E-G) One of the 11 maximal colored subnetworksidentified in BRCA Luminal A dataset. (E) The colored subnetwork (with 8 nodes) topology.(F) Pathways dysregulated by alterations harboured by the genes in the subnetwork - thesegenes are involved in EGFR, ERBB2, and FGFR signaling pathways. (G) Kaplan-Meier plotshowing the significant association of the subnetwork, with patients’ clinical outcome. (H-J)One of the 10 maximal colored subnetworks identified in COAD dataset. (H) The coloredsubnetwork (with 9 nodes) topology. (I) Pathways dysregulated by the alterations harbouredby the genes in the subnetwork - these genes are involved in signal transduction and apoptoticprocess. (J) Kaplan-Meier plot showing the significant association of the subnetwork withpatients’ clinical outcome (73 High Risk vs 83 Low Risk patients).

92

Figure 5.3: Colorful maximal subnetworks. (A-D) Number of maximal solutions and the sizeof the conserved colorful subnetwork obtained using the MCSI formulation, as a function ofnetwork depth t, in each of four cancer types analyzed on the STRING v10 (high confidenceedges) PPI network. The horizontal axis denotes the depth (number of patients) of the net-work. For the blue plot, the vertical axis denotes the maximum possible network size (interms of the number of nodes) and thus it is strictly non-increasing by definition. For the plotswith different colors, the vertical axis denotes the number of distinct networks with networksize equal to that indicated by the blue plot. (E-G) One of the maximal colorful subnetworksidentified in the COAD dataset with depth 108 (patients). (E) The colored subnetwork (with9 nodes) topology - obtained from STRING v10 (with experimentally validated edges) PPInetwork. (F) Pathways dysregulated by alterations harboured by the genes in the subnetwork.(G) Kaplan-Meier plot showing the significant association of the subnetwork, with patients’clinical outcome (59 High Risk vs 47 Low Risk patients). (H-J) One of the maximal color-ful subnetworks identified in the Luminal A dataset with no color restrictions, with depth of58 (patients). (H) The colored subnetwork (with 8 nodes) topology - obtained in the REAC-TOME PPI network. (I) Pathways dysregulated by the alterations harboured by the genes inthe subnetwork. (J) Kaplan-Meier plot showing the significant association of the subnetworkwith patients’ clinical outcome (30 High Risk vs 30 Low Risk patients).

93

Figure 5.4: Multiple subnetwork analysis. Two largest among the 15 subnetworks identifiedacross the COAD, GBM and BRCA data sets (5 per each) through the MCSC formulation ofcd-CAP on STRING v10.5 (with experimentally validated edges) PPI network. The numberin parenthesis next to each node represents the univariate Cox proportional-hazard regressioncoefficient estimated for that gene, used as its weight in the risk-score calculation to stratifythe patients into two distinct risk groups. (See section 5.7.3 for details). (A-C) The largest ofthe 5 COAD subnetworks with a network depth of 149 (patients). (A) The subnetwork topol-ogy (with 15 nodes). (B) Pathways dysregulated by alterations harboured by the genes in thesubnetwork. (C) Kaplan-Meier plot showing the significant association of the subnetwork,with patients’ clinical outcome (69 High Risk vs 78 Low Risk patients). (D-F) The largest ofthe 5 BRCA subnetworks with a network depth of 313 (patients). (D) The subnetwork topol-ogy (with 15 nodes). (E) Pathways dysregulated by the alterations harboured by the genes inthe subnetwork. (F) Kaplan-Meier plot showing the significant association of the subnetworkwith patients’ clinical outcome (33 High Risk vs 278 Low Risk patients).

94

Figure 5.5: Empirical p-value estimates for the maximum size subnetworks identified by cd-CAP. Compared with the subnetworks observed in real mutation profiles, those identified bycd-CAP in permutation tests (with identical t values) were much smaller, implying a p-valueof < 0.001 for each of the colored subnetworks presented in Figure 5.2.

95

Chapter 6

Conclusion

In recent years, there has been an unprecedented increase in the multi-dimensional high-throughput data

profiling (especially genome, transcriptome, proteome, and epigenome) of cancer patients. This has

revealed extensive mutational heterogeneity observed in the cancer (sub)types, yielding a long-tailed

distribution of mutated genes across the patients, implying the existence of many rare/private driver

genes. Thus, there is a great need for computational methods to mine these massive datasets and prioritize

clinically actionable driver events to aid treatment modalities using precision oncology.

The primary goal of this thesis was to develop novel computational algorithms to identify and priori-

tize cancer driver genes and provide insight into the heterogeneous biology to guide precision oncology.

We introduced HIT’nDRIVE, a combinatorial algorithm to prioritize cancer driver genes. HIT’nDRIVE

models the information flow connecting the genomic aberrations to the changes in global expression pat-

tern in the transcriptome. HIT’nDRIVE measures the potential impact of genomic aberrations on changes

in the global expression of other genes/proteins which are in close proximity in a gene/protein-interaction

network. HIT’nDRIVE then prioritizes those aberrations with the highest impact as cancer driver genes.

We formulated the driver prioritization problem as a “random-walk facility location” (RWFL) problem,

which differs from the standard facility location problem by its use of “hitting time”, the expected number

of hops to reach a “target” gene from a “source” gene, as a distance measure in an interaction network.

HIT’nDRIVE uses “inverse” hitting time as a measure of influence of a source gene over a target gene

to identify the subset of sequencewise altered/source genes whose overall influence over expression al-

tered/target genes is maximum possible.

We further demonstrated that HIT’nDRIVE accurately predicts patient-specific predicts cancer driver

genes. We also demonstrate that by using HIT’nDRIVE-identified driver genes and associated “network

modules” (sub-networks seeded by driver genes whose aggregate expression profiles correlate well with

96

the cancer phenotype) as features, it is possible to perform accurate phenotype classification. We also

demonstrate that these driver modules are associated with patients’ survival outcome and accurately pre-

dict drug efficacy in pan-cancer cell lines. Altogether, HIT’nDRIVE may help clinicians contextualize

massive multi-omics data in therapeutic decision making widespread implementation of precision oncol-

ogy possible.

In chapter 4, we described the first-in-field integrative multi-omics characterization of a cohort of ma-

lignant peritoneal mesothelioma (PeM). To our knowledge this is the largest cohort, of this rare tumor, to

be subjected to an integrative multi-omics analysis. We presented the integrated genome, transcriptome,

and proteome landscape. BAP1 loss of function is known to be a key driver event of PeM. However, the

downstream molecular and clinical significance of BAP1 loss has not been investigated in context of PeM

and we show that it is predictive for immunotherapy. We found that BAP1 loss forms a distinct molecular

subtype characterized by dysregulated gene expression patterns of chromatin remodeling, DNA repair

pathways, and immune checkpoint receptor activation. We further demonstrated that this subtype is

correlated with an inflammatory tumor microenvironment and thus a candidate for immune checkpoint

blockade therapies. Thus, BAP1 is a biomarker for PeM immunotherapy in 50% of cases we studied.

This is of critical importance because PeM is a rare and understudied cancer for which chemotherapy

and targeted therapies have proven ineffective. BAP1 stratification may improve drug response rates in

ongoing phase-I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM.

In these BAP1 status is not currently taken into account.

Further, we resolved the discordance between mRNA and protein expression patterns in this cohort

and this may apply to other studies incorporating mass spectrometry. The discordance between mRNA

and protein levels was found to be due to multimeric protein complexes of chromatin remodeling genes

the majority of which are direct protein-interaction partners of BAP1. The discordance between the

mRNA and the protein expression patterns is most likely due to the ubiquitination and degradation of

proteins in these BAP1 regulated complexes to maintain functional stoichiometry.

Finally, in chapter 5, we introduced cd-CAP, a combinatorial algorithm to identify sub-networks

with conserved molecular alteration pattern across a large subset of a tumor sample cohort. cd-CAP

simultaneously identifies more than one subnetwork, and each gene within each subnetwork has labels

specific to the alteration types it harbors. Notably, we demonstrate that many of the largest highly

conserved subnetworks within a tumor type solely consist of genes that have been subject to copy number

gain, typically located on the same chromosomal arm and thus likely a result of a single, large scale

amplification. We have also demonstrated that the subnetworks identified using cd-CAP are associated

with patients’ survival outcome and hence are clinically important.

97

6.1 Future PerspectiveContinuous development and validation of novel algorithms for identification and prioritization of cancer

driver genes, especially rare driver genes, will be essential given the exponential growth of sequenced tu-

mors. Many studies over the past decade have focused on driver genomic aberration on the protein-coding

regions of the gene. Driver genes harbouring aberration in the non-coding regions are emerging. With

the rise of multi-omics data profiled for a given tumor, efficient computational algorithms to integrate

meaningful information from these multi-omics data together with curated knowledge of signaling path-

way/network will be necessary. Inclusion of epigenome (DNA methylation) and 3D genome interaction

data (Hi-C) data together with genome, transcriptome, and proteome would be necessary to understand

cancer initiation and progression. Furthermore, I believe, as the regulatory interaction-network covering

the non-coding genome will be available in the near future, this will trigger the next wave of algorithm

combining different types of data mentioned above.

Inference of sub-clonal population structure and identification of sub-clonal driver genes is another

avenue which is necessary for correct identification of driver genes. However, the shallow sequencing

depth of the available tumor whole genome sequences has remained as a bottleneck to correctly estimate

the correct sub-clonal population structure of tumor samples. Thus, I believe, as the high-throughput

sequencing cost further shrinks and ultra-high coverage genomes become more common, efficient com-

putational algorithms would be able to correctly identify sub-clonal driver genes providing further insight

into tumor evolution-guided clinically actionable targets.

In recent past, single-cell sequencing technology (single-cell DNA-seq, RNA-seq, and methylation

profiling) has surged as a promising technique to study molecular changes at a single-cell resolution. I

believe, advances in development of computational tools to analyze single-cell sequencing data for re-

solving intra-tumor heterogeneity, spatial heterogeneity, and reconstructing sub-clonal population struc-

ture in tumor will provide new insights in oncology research.

On the other hand, Deep Neural Network (also known as deep learning) has been recognized as an

efficient approach for learning the functional relationships between different types of related data. Al-

though in its infancy, one of deep neural network based methods, IBM Watson Oncology (https://www.ibm.com/watson/),

is being tested for its utility in cancer therapeutics across research centres around the world. Similarly,

Google’s DeepMind Health (https://deepmind.com) is being tested to mine patients’ medical reports to

predict appropriate treatment in the UK. I believe, combining the algorithmic approaches described in

this thesis together with deep neural network approach would help such computational tools to become

more powerful and robust which critical for precision oncology.

98

Bibliography

[1] AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an InternationalConsortium. Cancer discovery, 7(8):818–831, 2017. ISSN 2159-8290. doi:10.1158/2159-8290.CD-17-0151. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28572459. → pages 53

[2] I. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev. Amethod and server for predicting damaging missense mutations. Nature methods, 7(4):248–249, 2010. ISSN1548-7091. doi:10.1038/nmeth0410-248. URL https://www.ncbi.nlm.nih.gov/pubmed/20354512. → pages 3

[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20thInternational Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994.Morgan Kaufmann Publishers Inc. ISBN 1-55860-153-8. → pages 77, 80

[4] U. D. Akavia, O. Litvin, J. Kim, F. Sanchez-Garcia, D. Kotliar, et al. An integrated approach to uncover drivers ofcancer. Cell, 143(6):1005–17, Dec. 2010. ISSN 1097-4172. doi:10.1016/j.cell.2010.11.013. → pages 4

[5] H. Alakus, S. E. Yost, B. Woo, R. French, G. Y. Lin, K. Jepsen, K. A. Frazer, A. M. Lowy, and O. Harismendy. BAP1mutation is a frequent somatic event in peritoneal malignant mesothelioma. Journal of translational medicine, 13(1):122, 2015. ISSN 1479-5876. doi:10.1186/s12967-015-0485-1. URL https://www.ncbi.nlm.nih.gov/pubmed/25889843.→ pages 51, 53

[6] N. Alcaraz, T. Friedrich, T. Kotzing, A. Krohmer, J. Muller, J. Pauling, and J. Baumbach. Efficient key pathwaymining: combining networks and OMICS data. Integrative biology : quantitative biosciences from nano to macro, 4(7):756–64, jul 2012. ISSN 1757-9708. doi:10.1039/c2ib00133k. URL http://www.ncbi.nlm.nih.gov/pubmed/22353882.→ pages 76, 87

[7] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, S. a. J. R. Aparicio, S. Behjati, A. V. Biankin, G. R. Bignell, N. Bolli,A. Borg, A.-L. Børresen-Dale, S. Boyault, B. Burkhardt, A. P. Butler, C. Caldas, H. R. Davies, C. Desmedt, R. Eils,J. E. Eyfjord, J. a. Foekens, M. Greaves, F. Hosoda, B. Hutter, T. Ilicic, S. Imbeaud, M. Imielinski, M. Imielinsk,N. Jager, D. T. W. Jones, D. Jones, S. Knappskog, M. Kool, S. R. Lakhani, C. Lopez-Otın, S. Martin, N. C. Munshi,H. Nakamura, P. a. Northcott, M. Pajic, E. Papaemmanuil, A. Paradiso, J. V. Pearson, X. S. Puente, K. Raine,M. Ramakrishna, A. L. Richardson, J. Richter, P. Rosenstiel, M. Schlesner, T. N. Schumacher, P. N. Span, J. W. Teague,Y. Totoki, A. N. J. Tutt, R. Valdes-Mas, M. M. van Buuren, L. van ’t Veer, A. Vincent-Salomon, N. Waddell, L. R.Yates, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-SeqConsortium, ICGC PedBrain, J. Zucman-Rossi, P. A. Futreal, U. McDermott, P. Lichter, M. Meyerson, S. M.Grimmond, R. Siebert, E. Campo, T. Shibata, S. M. Pfister, P. J. Campbell, and M. R. Stratton. Signatures of mutationalprocesses in human cancer. Nature, 500(7463):415–21, aug 2013. ISSN 1476-4687. doi:10.1038/nature12477. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23945592. → pages 52, 75

[8] L. B. Alexandrov, P. H. Jones, D. C. Wedge, J. E. Sale, P. J. Campbell, S. Nik-Zainal, and M. R. Stratton. Clock-likemutational processes in human somatic cells. Nature Genetics, 47(12):1402–1407, 2015. ISSN 15461718.doi:10.1038/ng.3441. URL https://www.ncbi.nlm.nih.gov/pubmed/26551669. → pages 67

99

[9] E. W. Alley, J. Lopez, A. Santoro, A. Morosky, S. Saraf, B. Piperdi, and E. van Brummelen. Clinical safety and activityof pembrolizumab in patients with malignant pleural mesothelioma (KEYNOTE-028): preliminary results from anon-randomised, open-label, phase 1b trial. The Lancet Oncology, 18(5):623–630, 2017. ISSN 14745488.doi:10.1016/S1470-2045(17)30169-9. URL https://www.ncbi.nlm.nih.gov/pubmed/28291584. → pages 61

[10] S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome biology, 11(10):R106,2010. ISSN 1474-760X. doi:10.1186/gb-2010-11-10-r106. URL http://www.ncbi.nlm.nih.gov/pubmed/209796212. →pages 65

[11] S. Anders, P. T. Pyl, and W. Huber. HTSeq-A Python framework to work with high-throughput sequencing data.Bioinformatics, 31(2):166–169, 2015. ISSN 14602059. doi:10.1093/bioinformatics/btu638. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25260700. → pages 65

[12] M. H. Bailey, C. Tokheim, E. Porta-Pardo, S. Sengupta, D. Bertrand, A. Weerasinghe, A. Colaprico, M. C. Wendl,J. Kim, B. Reardon, P. K.-S. Ng, K. J. Jeong, S. Cao, Z. Wang, J. Gao, Q. Gao, F. Wang, E. M. Liu, L. Mularoni,C. Rubio-Perez, N. Nagarajan, I. Cortes-Ciriano, D. C. Zhou, W.-W. Liang, J. M. Hess, V. D. Yellapantula,D. Tamborero, A. Gonzalez-Perez, C. Suphavilai, J. Y. Ko, E. Khurana, P. J. Park, E. M. Van Allen, H. Liang, MC3Working Group, Cancer Genome Atlas Research Network, M. S. Lawrence, A. Godzik, N. Lopez-Bigas, J. Stuart,D. Wheeler, G. Getz, K. Chen, A. J. Lazar, G. B. Mills, R. Karchin, and L. Ding. Comprehensive Characterization ofCancer Driver Genes and Mutations. Cell, 173(2):371–385.e18, apr 2018. ISSN 1097-4172.doi:10.1016/j.cell.2018.02.060. URL http://www.ncbi.nlm.nih.gov/pubmed/29625053. → pages 34, 75

[13] C. E. Barbieri, S. C. Baca, M. S. Lawrence, F. Demichelis, M. Blattner, J.-P. Theurillat, T. a. White, P. Stojanov, E. VanAllen, N. Stransky, E. Nickerson, S.-S. Chae, G. Boysen, D. Auclair, R. C. Onofrio, K. Park, N. Kitabayashi, T. Y.MacDonald, K. Sheikh, T. Vuong, C. Guiducci, K. Cibulskis, A. Sivachenko, S. L. Carter, G. Saksena, D. Voet, W. M.Hussain, A. H. Ramos, W. Winckler, M. C. Redman, K. Ardlie, A. K. Tewari, J. M. Mosquera, N. Rupp, P. J. Wild,H. Moch, C. Morrissey, P. S. Nelson, P. W. Kantoff, S. B. Gabriel, T. R. Golub, M. Meyerson, E. S. Lander, G. Getz,M. a. Rubin, and L. a. Garraway. Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations inprostate cancer. Nature genetics, 44(6):685–9, jun 2012. ISSN 1546-1718. doi:10.1038/ng.2279. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22610119. → pages 36

[14] A. Bashashati, G. Haffari, J. Ding, G. Ha, K. Lui, et al. DriverNet: uncovering the impact of somatic driver mutationson transcriptional networks in cancer. Genome biology, 13(12):R124, Dec. 2012. ISSN 1465-6914.doi:10.1186/gb-2012-13-12-r124. → pages 6, 19

[15] D. G. Beer, S. L. R. Kardia, C.-C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib,D. G. Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. G. Taylor, M. D. Iannettoni, M. B. Orringer, andS. Hanash. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 8(8):816–24, aug 2002. ISSN 1078-8956. doi:10.1038/nm733. → pages 44, 89

[16] H. Beltran, K. Eng, J. M. Mosquera, A. Sigaras, A. Romanel, H. Rennert, M. Kossai, C. Pauli, B. Faltas, J. Fontugne,K. Park, J. Banfelder, D. Prandi, N. Madhukar, T. Zhang, J. Padilla, N. Greco, T. J. McNary, E. Herrscher, D. Wilkes,T. Y. MacDonald, H. Xue, V. Vacic, A.-K. Emde, D. Oschwald, A. Y. Tan, Z. Chen, C. Collins, M. E. Gleave, Y. Wang,D. Chakravarty, M. Schiffman, R. Kim, F. Campagne, B. D. Robinson, D. M. Nanus, S. T. Tagawa, J. Z. Xiang,A. Smogorzewska, F. Demichelis, D. S. Rickman, A. Sboner, O. Elemento, and M. a. Rubin. Whole-ExomeSequencing of Metastatic Cancer and Biomarkers of Treatment Response. JAMA Oncology, 10021, 2015. ISSN2374-2437. doi:10.1001/jamaoncol.2015.1313. URL https://www.ncbi.nlm.nih.gov/pubmed/26181256. → pages 43

[17] G. Bianchini, J. M. Balko, I. A. Mayer, M. E. Sanders, and L. Gianni. Triple-negative breast cancer: challenges andopportunities of a heterogeneous disease. Nature reviews. Clinical oncology, may 2016. ISSN 1759-4782.doi:10.1038/nrclinonc.2016.66. URL http://www.ncbi.nlm.nih.gov/pubmed/27184417. → pages 39

100

[18] A. Bomersbach, M. Chiarandini, and F. Vandin. An Efficient Branch and Cut Algorithm to Find Frequently MutatedSubnetworks in Cancer. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms in Bioinformatics, pages 27–39,Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4. doi:10.1007/978-3-319-43681-4 3. URLhttp://link.springer.com/10.1007/978-3-642-33122-0. → pages 76, 79, 80, 88

[19] M. Bott, M. Brevet, B. S. Taylor, S. Shimizu, T. Ito, L. Wang, J. Creaney, R. a. Lake, M. F. Zakowski, B. Reva,C. Sander, R. Delsite, S. Powell, Q. Zhou, R. Shen, A. Olshen, V. Rusch, and M. Ladanyi. The nuclear deubiquitinaseBAP1 is commonly inactivated by somatic mutations and 3p21.1 losses in malignant pleural mesothelioma. Naturegenetics, 43(7):668–672, 2011. ISSN 1061-4036. doi:10.1038/ng.855. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21642991. → pages 53

[20] N. J. Bowen, L. D. Walker, L. V. Matyunina, S. Logani, K. a. Totten, B. B. Benigno, and J. F. McDonald. Geneexpression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of servingas ovarian cancer initiating cells. BMC medical genomics, 2:71, 2009. ISSN 1755-8794. doi:10.1186/1755-8794-2-71.→ pages 24

[21] S. E. Bowyer, A. D. Rao, M. Lyle, S. Sandhu, G. V. Long, G. a. McArthur, J. M. Raleigh, R. J. Hicks, and M. Millward.Activity of trametinib in K601E and L597Q BRAF mutation-positive metastatic melanoma. Melanoma research, 24(5):504–8, 2014. ISSN 1473-5636. doi:10.1097/CMR.0000000000000099. → pages 38

[22] C. W. Brennan, R. G. W. Verhaak, A. McKenna, B. Campos, H. Noushmehr, S. R. Salama, S. Zheng, D. Chakravarty,J. Z. Sanborn, S. H. Berman, R. Beroukhim, B. Bernard, C.-J. Wu, G. Genovese, I. Shmulevich, J. Barnholtz-Sloan,L. Zou, R. Vegesna, S. a. Shukla, G. Ciriello, W. K. Yung, W. Zhang, C. Sougnez, T. Mikkelsen, K. Aldape, D. D.Bigner, E. G. Van Meir, M. Prados, A. Sloan, K. L. Black, J. Eschbacher, G. Finocchiaro, W. Friedman, D. W.Andrews, A. Guha, M. Iacocca, B. P. O’Neill, G. Foltz, J. Myers, D. J. Weisenberger, R. Penny, R. Kucherlapati, C. M.Perou, D. N. Hayes, R. Gibbs, M. Marra, G. B. Mills, E. Lander, P. Spellman, R. Wilson, C. Sander, J. Weinstein,M. Meyerson, S. Gabriel, P. W. Laird, D. Haussler, G. Getz, L. Chin, and TCGA Research Network. The somaticgenomic landscape of glioblastoma. Cell, 155(2):462–77, oct 2013. ISSN 1097-4172. doi:10.1016/j.cell.2013.09.034.→ pages 86

[23] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDNSystems, 30(1-7):107–117, apr 1998. ISSN 01697552. doi:10.1016/S0169-7552(98)00110-X. URLhttp://dx.doi.org/10.1016/S0169-7552(98)00110-X. → pages 6

[24] R. Bueno, E. W. Stawiski, L. D. Goldstein, S. Durinck, A. De Rienzo, Z. Modrusan, F. Gnad, T. T. Nguyen, B. S.Jaiswal, L. R. Chirieac, D. Sciaranghella, N. Dao, C. E. Gustafson, K. J. Munir, J. A. Hackney, A. Chaudhuri, R. Gupta,J. Guillory, K. Toy, C. Ha, Y.-J. Chen, J. Stinson, S. Chaudhuri, N. Zhang, T. D. Wu, D. J. Sugarbaker, F. J. de Sauvage,W. G. Richards, and S. Seshagiri. Comprehensive genomic analysis of malignant pleural mesothelioma identifiesrecurrent mutations, gene fusions and splicing alterations. Nature Genetics, 48(October 2015):1–13, 2016. ISSN1061-4036. doi:10.1038/ng.3520. URL http://www.ncbi.nlm.nih.gov/pubmed/26928227. → pages 53, 55

[25] L. Calabro, A. Morra, E. Fonsatti, O. Cutaia, G. Amato, D. Giannarelli, A. M. Di Giacomo, R. Danielli, M. Altomonte,L. Mutti, and M. Maio. Tremelimumab for patients with chemotherapy-resistant advanced malignant mesothelioma:An open-label, single-arm, phase 2 trial. The Lancet Oncology, 14(11):1104–1111, 2013. ISSN 14702045.doi:10.1016/S1470-2045(13)70381-4. URL https://www.ncbi.nlm.nih.gov/pubmed/24035405. → pages 51, 61

[26] L. Calabro, A. Morra, E. Fonsatti, O. Cutaia, C. Fazio, D. Annesi, M. Lenoci, G. Amato, R. Danielli, M. Altomonte,D. Giannarelli, A. M. Di Giacomo, and M. Maio. Efficacy and safety of an intensified schedule of tremelimumab forchemotherapy-resistant malignant mesothelioma: An open-label, single-arm, phase 2 study. The Lancet RespiratoryMedicine, 3(4):301–309, 2015. ISSN 22132619. doi:10.1016/S2213-2600(15)00092-2. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25819643. → pages 61

101

[27] L. Calabro, A. Morra, D. Giannarelli, G. Amato, A. D’Incecco, A. Covre, A. Lewis, M. C. Rebelatto, R. Danielli,M. Altomonte, A. M. Di Giacomo, and M. Maio. Tremelimumab combined with durvalumab in patients withmesothelioma (NIBIT-MESO-1): an open-label, non-randomised, phase 2 study. The Lancet. Respiratory medicine,2600(18):1–10, may 2018. ISSN 2213-2619. doi:10.1016/S2213-2600(18)30151-6. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29773326. → pages 51

[28] P. J. Campbell. Cliques and Schisms of Cancer Genes. Cancer Cell, 32(2):129–130, 2017. ISSN 18783686.doi:10.1016/j.ccell.2017.07.009. URL http://dx.doi.org/10.1016/j.ccell.2017.07.009. → pages 76

[29] H. Carter, S. Chen, L. Isik, S. Tyekucheva, V. E. Velculescu, K. W. Kinzler, B. Vogelstein, and R. Karchin.Cancer-specific high-throughput annotation of somatic mutations: Computational prediction of driver missensemutations. Cancer Research, 69(16):6660–6667, 2009. ISSN 00085472. doi:10.1158/0008-5472.CAN-09-1133. URLhttps://www.ncbi.nlm.nih.gov/pubmed/19654296. → pages 3

[30] L. Carvallo, R. Munoz, F. Bustos, N. Escobedo, H. Carrasco, G. Olivares, and J. Larraın. Non-canonical Wnt signalinginduces ubiquitination and degradation of Syndecan4. The Journal of biological chemistry, 285(38):29546–55, sep2010. ISSN 1083-351X. doi:10.1074/jbc.M110.155812. URL http://www.ncbi.nlm.nih.gov/pubmed/20639201. →pages 84

[31] B. S. Carver, J. Tran, A. Gopalan, Z. Chen, S. Shaikh, A. Carracedo, A. Alimonti, C. Nardella, S. Varmeh, P. T.Scardino, C. Cordon-Cardo, W. Gerald, and P. P. Pandolfi. Aberrant ERG expression cooperates with loss of PTEN topromote cancer progression in the prostate. Nature Genetics, 41(5):619–624, 2009. ISSN 1061-4036.doi:10.1038/ng.370. URL https://www.ncbi.nlm.nih.gov/pubmed/19396168. → pages 75

[32] E. Cerami, E. Demir, N. Schultz, B. S. Taylor, and C. Sander. Automated network analysis identifies core pathways inglioblastoma. PLoS ONE, 5(2), 2010. ISSN 19326203. doi:10.1371/journal.pone.0008918. → pages 4

[33] F. Chen, Y. Zhang, Y. Senbabaoglu, G. Ciriello, L. Yang, E. Reznik, B. Shuch, G. Micevic, G. De Velasco, E. Shinbrot,M. S. Noble, Y. Lu, K. R. Covington, L. Xi, J. A. Drummond, D. Muzny, H. Kang, J. Lee, P. Tamboli, V. Reuter, C. S.Shelley, B. A. Kaipparettu, D. P. Bottaro, A. K. Godwin, R. A. Gibbs, G. Getz, R. Kucherlapati, P. J. Park, C. Sander,E. P. Henske, J. H. Zhou, D. J. Kwiatkowski, T. H. Ho, T. K. Choueiri, J. J. Hsieh, R. Akbani, G. B. Mills, A. A.Hakimi, D. A. Wheeler, and C. J. Creighton. Multilevel Genomics-Based Taxonomy of Renal Cell Carcinoma. CellReports, 14(10):2476–2489, 2016. ISSN 22111247. doi:10.1016/j.celrep.2016.02.024. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26947078. → pages 60, 75

[34] P. Chirac, D. Maillet, F. Lepretre, S. Isaac, O. Glehen, M. Figeac, L. Villeneuve, J. Peron, F. Gibson, F. Galateau-Salle,F. N. Gilly, and M. Brevet. Genomic copy number alterations in 33 malignant peritoneal mesothelioma analyzed bycomparative genomic hybridization array. Human Pathology, 55:72–82, 2016. ISSN 15328392.doi:10.1016/j.humpath.2016.04.015. URL https://www.ncbi.nlm.nih.gov/pubmed/27184482. → pages 51

[35] D.-Y. Cho, Y.-A. Kim, and T. M. Przytycka. Chapter 5: Network Biology Approach to Complex Diseases. PLoSComputational Biology, 8(12):e1002820, Dec. 2012. ISSN 1553-7358. doi:10.1371/journal.pcbi.1002820. URLhttp://dx.plos.org/10.1371/journal.pcbi.1002820. → pages 5

[36] S. A. Chowdhury, S. E. Shackney, K. Heselmeyer-Haddad, T. Ried, A. A. Schaffer, and R. Schwartz. Algorithms toModel Single Gene, Single Chromosome, and Whole Genome Copy Number Changes Jointly in Tumor Phylogenetics.PLoS Computational Biology, 10(7), 2014. ISSN 15537358. doi:10.1371/journal.pcbi.1003740. → pages 78

[37] G. Ciriello, E. Cerami, C. Sander, and N. Schultz. Mutual exclusivity analysis identifies oncogenic network modules.Genome research, 22(2):398–406, Feb. 2012. ISSN 1549-5469. doi:10.1101/gr.125567.111. → pages 4, 76

102

[38] G. Ciriello, M. L. Miller, B. A. Aksoy, Y. Senbabaoglu, N. Schultz, and C. Sander. Emerging landscape of oncogenicsignatures across human cancers. Nature genetics, 45(10):1127–1133, sep 2013. ISSN 1546-1718.doi:10.1038/ng.2762. → pages 36

[39] S. Condamin, O. Benichou, V. Tejedor, R. Voituriez, and J. Klafter. First-passage times in complex scale-invariantmedia. Nature, 450(7166):77–80, 2007. ISSN 0028-0836. doi:10.1038/nature06201. → pages 7, 12

[40] G. Cormode, G. Cormode, M. Paterson, M. Paterson, S. Sahinalp, S. Sahinalp, U. Vishkin, and U. Vishkin.Communication complexity of document exchange. In Proceedings of the eleventh annual ACM-SIAM symposium onDiscrete algorithms, pages 197–206, Philadelphia, 2000. Society for Industrial and Applied Mathematics. ISBN0-89871-453-2. URL http://portal.acm.org/citation.cfm?id=338219.338252. → pages 78

[41] L. Cowen, T. Ideker, B. J. Raphael, and R. Sharan. Network propagation: a universal amplifier of genetic associations.Nature reviews. Genetics, 18(9):551–562, 2017. ISSN 1471-0064. doi:10.1038/nrg.2017.38. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28607512. → pages 5

[42] C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa,Y. Yuan, S. Graf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, A. Langerød, A. Green, E. Provenzano,G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Børresen-Dale, J. D.Brenton, S. Tavare, C. Caldas, and S. Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumoursreveals novel subgroups. Nature, 486(7403):346–52, jun 2012. ISSN 1476-4687. doi:10.1038/nature10983. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22522925. → pages 24, 39, 78, 84, 85, 88

[43] K. B. Dahlman, J. Xia, K. Hutchinson, C. Ng, D. Hucks, P. Jia, M. Atefi, Z. Su, S. Branch, P. L. Lyle, D. J. Hicks,V. Bozon, J. A. Glaspy, N. Rosen, D. B. Solit, J. L. Netterville, C. L. Vnencak-Jones, J. A. Sosman, A. Ribas, Z. Zhao,and W. Pao. BRAFL597 mutations in melanoma are associated with sensitivity to MEK inhibitors. Cancer Discovery,2(9):791–797, 2012. ISSN 21598274. doi:10.1158/2159-8290.CD-12-0097. → pages 38

[44] P. Dao, K. Wang, C. Collins, M. Ester, A. Lapuk, and S. C. Sahinalp. Optimally discriminative subnetwork markerspredict response to chemotherapy. Bioinformatics, 27(13), Jul 2011. → pages 14

[45] P. Dao, Y.-A. Kim, D. Wojtowicz, S. Madan, R. Sharan, and T. M. Przytycka. BeWith: A Between-Within method todiscover relationships between cancer modules via integrated analysis of mutual exclusivity, co-occurrence andfunctional interactions. PLoS computational biology, 13(10):e1005695, oct 2017. ISSN 1553-7358.doi:10.1371/journal.pcbi.1005695. → pages 76

[46] N. D. Dees, Q. Zhang, C. Kandoth, M. C. Wendl, W. Schierding, D. C. Koboldt, T. B. Mooney, M. B. Callaway,D. Dooling, E. R. Mardis, R. K. Wilson, and L. Ding. MuSiC: Identifying mutational significance in cancer genomes.Genome Research, 22(8):1589–1598, 2012. ISSN 10889051. doi:10.1101/gr.134635.111. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22759861. → pages 2

[47] M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. del Angel, M. A.Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel,D. Altshuler, and M. J. Daly. A framework for variation discovery and genotyping using next-generation DNAsequencing data. Nature Genetics, 43(5):491–498, 2011. ISSN 1061-4036. doi:10.1038/ng.806. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21478889. → pages 63

[48] J. Ding, M. K. McConechy, H. M. Horlings, G. Ha, F. Chun Chan, T. Funnell, S. C. Mullaly, J. Reimand,A. Bashashati, G. D. Bader, D. Huntsman, S. Aparicio, A. Condon, and S. P. Shah. Systematic analysis of somaticmutations impacting gene expression in 12 tumour types. Nature Communications, 6(1):8554, dec 2015. ISSN2041-1723. doi:10.1038/ncomms9554. URL http://www.ncbi.nlm.nih.gov/pubmed/26436532. → pages 4

103

[49] L. Ding, T. J. Ley, D. E. Larson, C. a. Miller, D. C. Koboldt, et al. Clonal evolution in relapsed acute myeloidleukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–10, Jan. 2012. ISSN 1476-4687.doi:10.1038/nature10738. URL https://www.ncbi.nlm.nih.gov/pubmed/22237025. → pages 3

[50] A. Dobin, C. a. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. STAR:Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. ISSN 13674803.doi:10.1093/bioinformatics/bts635. URL https://www.ncbi.nlm.nih.gov/pubmed/23104886. → pages 64

[51] B. Dutta, L. Pusztai, Y. Qi, F. Andre, V. Lazar, G. Bianchini, N. Ueno, R. Agarwal, B. Wang, C. Y. Shiang, G. N.Hortobagyi, G. B. Mills, W. F. Symmans, and G. Balazsi. A network-based, integrative study to identify core biologicalpathways that drive breast cancer clinical subtypes. British journal of cancer, 106(6):1107–16, mar 2012. ISSN1532-1827. doi:10.1038/bjc.2011.584. URL http://www.ncbi.nlm.nih.gov/pubmed/22343619. → pages 39

[52] M. Dyer and A. Frieze. On the complexity of partitioning graphs into connected subgraphs. Discrete AppliedMathematics, 10(2):139 – 153, 1985. ISSN 0166-218X. doi:10.1016/0166-218X(85)90008-3. URLhttp://www.sciencedirect.com/science/article/pii/0166218X85900083. → pages 80

[53] M. El-Kebir and G. W. Klau. Solving the Maximum-Weight Connected Subgraph Problem to Optimality. arXiv, pages1–32, sep 2014. URL http://arxiv.org/abs/1409.5308. → pages 76

[54] M. El-kebir, B. J. Raphael, R. Shamir, R. Sharan, S. Zaccaria, M. Zehavi, and R. Zeira. Copy-Number EvolutionProblems: Complexity and Algorithms. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms inBioinformatics, pages 137–149, Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4.doi:10.1007/978-3-319-43681-4 11. URL http://link.springer.com/10.1007/978-3-642-33122-0. → pages 78

[55] A. Fabregat, K. Sidiropoulos, P. Garapati, M. Gillespie, K. Hausmann, R. Haw, B. Jassal, S. Jupe, F. Korninger,S. McKay, L. Matthews, B. May, M. Milacic, K. Rothfels, V. Shamovsky, M. Webber, J. Weiser, M. Williams, G. Wu,L. Stein, H. Hermjakob, and P. D’Eustachio. The Reactome pathway Knowledgebase. Nucleic acids research, 44(D1):D481–7, jan 2016. ISSN 1362-4962. doi:10.1093/nar/gkv1351. URL http://www.ncbi.nlm.nih.gov/pubmed/24243840.→ pages 24

[56] D. A. Fennell, E. Kirkpatrick, K. Cozens, M. Nye, J. Lester, G. Hanna, N. Steele, P. Szlosarek, S. Danson, J. Lord,C. Ottensmeier, D. Barnes, S. Hill, M. Kalevras, T. Maishman, and G. Griffiths. CONFIRM: a double-blind,placebo-controlled phase III clinical trial investigating the effect of nivolumab in patients with relapsed mesothelioma:study protocol for a randomised controlled trial. Trials, 19(1):233, apr 2018. ISSN 1745-6215.doi:10.1186/s13063-018-2602-y. URL http://www.ncbi.nlm.nih.gov/pubmed/29669604. → pages 51

[57] A. Fernandez-Medarde and E. Santos. Ras in cancer and developmental diseases. Genes & cancer, 2(3):344–58, mar2011. ISSN 1947-6027. doi:10.1177/1947601911411084. URL http://www.ncbi.nlm.nih.gov/pubmed/21779504. →pages 84

[58] S. A. Forbes, D. Beare, H. Boutselakis, S. Bamford, N. Bindal, J. Tate, C. G. Cole, S. Ward, E. Dawson, L. Ponting,R. Stefancsik, B. Harsha, C. YinKok, M. Jia, H. Jubb, Z. Sondka, S. Thompson, T. De, and P. J. Campbell. COSMIC:Somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):D777–D783, 2017. ISSN 13624962.doi:10.1093/nar/gkw1121. → pages 35, 52

[59] P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman, and M. R. Stratton. A census ofhuman cancer genes. Nature reviews. Cancer, 4(3):177–83, mar 2004. ISSN 1474-175X. doi:10.1038/nrc1299. →pages 16, 34, 35

[60] G. Germano, S. Lamba, G. Rospo, L. Barault, A. Magrı, F. Maione, M. Russo, G. Crisafulli, A. Bartolini, G. Lerda,G. Siravegna, B. Mussolin, R. Frapolli, M. Montone, F. Morano, F. de Braud, N. Amirouchene-Angelozzi, S. Marsoni,

104

M. D’Incalci, A. Orlandi, E. Giraudo, A. Sartore-Bianchi, S. Siena, F. Pietrantonio, F. Di Nicolantonio, and A. Bardelli.Inactivation of DNA repair triggers neoantigen generation and impairs tumour growth. Nature, 2017. ISSN 0028-0836.doi:10.1038/nature24673. URL https://www.ncbi.nlm.nih.gov/pubmed/29186113. → pages 60

[61] E. E. Gill, L. S. Chan, G. L. Winsor, N. Dobson, R. Lo, S. J. Ho Sui, B. K. Dhillon, P. K. Taylor, R. Shrestha,C. Spencer, R. E. W. Hancock, P. J. Unrau, and F. S. L. Brinkman. High-throughput detection of RNA processing inbacteria. BMC genomics, 19(1):223, 2018. ISSN 1471-2164. doi:10.1186/s12864-018-4538-8. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29587634. → pages 9

[62] E. Goncalves, A. Fragoulis, L. Garcia-Alonso, T. Cramer, J. Saez-Rodriguez, and P. Beltrao. WidespreadPost-transcriptional Attenuation of Genomic Copy-Number Variation in Cancer. Cell Systems, 0(0):1–13, 2017. ISSN24054712. doi:10.1016/j.cels.2017.08.013. URL https://www.ncbi.nlm.nih.gov/pubmed/29032074. → pages 57, 58

[63] A. Gonzalez-Perez and N. Lopez-Bigas. Improving the assessment of the outcome of nonsynonymous SNVs with aconsensus deleteriousness score, Condel. American Journal of Human Genetics, 88(4):440–449, 2011. ISSN00029297. doi:10.1016/j.ajhg.2011.03.004. URL https://www.ncbi.nlm.nih.gov/pubmed/21457909. → pages 3

[64] A. Gonzalez-Perez and N. Lopez-Bigas. Functional impact bias reveals cancer drivers. Nucleic Acids Research, 40(21):1–10, 2012. ISSN 03051048. doi:10.1093/nar/gks743. URL https://www.ncbi.nlm.nih.gov/pubmed/22904074. →pages 3

[65] A. Gonzalez-Perez, J. Deu-Pons, and N. Lopez-Bigas. Improving the prediction of the functional impact of cancermutations by baseline tolerance transformation. Genome medicine, 4(11):89, 2012. ISSN 1756-994X.doi:10.1186/gm390. URL https://www.ncbi.nlm.nih.gov/pubmed/23181723. → pages 3

[66] C. S. Grasso, Y.-M. Wu, D. R. Robinson, X. Cao, S. M. Dhanasekaran, A. P. Khan, M. J. Quist, X. Jing, R. J. Lonigro,J. C. Brenner, I. a. Asangani, B. Ateeq, S. Y. Chun, J. Siddiqui, L. Sam, M. Anstett, R. Mehra, J. R. Prensner,N. Palanisamy, G. a. Ryslik, F. Vandin, B. J. Raphael, L. P. Kunju, D. R. Rhodes, K. J. Pienta, A. M. Chinnaiyan, andS. a. Tomlins. The mutational landscape of lethal castration-resistant prostate cancer. Nature, 487(7406):239–43, jul2012. ISSN 1476-4687. doi:10.1038/nature11125. URL https://www.ncbi.nlm.nih.gov/pubmed/22722839. → pages 24

[67] M. Greaves and C. C. Maley. Clonal evolution in cancer. Nature, 481(7381):306–13, Jan. 2012. ISSN 1476-4687.doi:10.1038/nature10762. URL https://www.ncbi.nlm.nih.gov/pubmed/22258609. → pages 1, 3

[68] C. Greenman, R. Wooster, P. A. Futreal, M. R. Stratton, and D. F. Easton. Statistical analysis of pathogenicity ofsomatic mutations in cancer. Genetics, 173(4):2187–98, Aug. 2006. ISSN 0016-6731.doi:10.1534/genetics.105.044677. URL https://www.ncbi.nlm.nih.gov/pubmed/16783027. → pages 2

[69] C. Greenman, P. Stephens, R. Smith, G. L. Dalgliesh, C. Hunter, et al. Patterns of somatic mutation in human cancergenomes. Nature, 446(7132):153–8, Mar. 2007. ISSN 1476-4687. doi:10.1038/nature05610. URLhttps://www.ncbi.nlm.nih.gov/pubmed/17344846. → pages 1, 10

[70] M. Griffith, O. L. Griffith, A. C. Coffman, J. V. Weible, J. F. McMichael, N. C. Spies, J. Koval, I. Das, M. B. Callaway,J. M. Eldred, C. a. Miller, J. Subramanian, R. Govindan, R. D. Kumar, R. Bose, L. Ding, J. R. Walker, D. E. Larson,D. J. Dooling, S. M. Smith, T. J. Ley, E. R. Mardis, and R. K. Wilson. DGIdb: mining the druggable genome. Naturemethods, 10(12):1209–10, 2013. ISSN 1548-7105. doi:10.1038/nmeth.2689. → pages 35

[71] A. Gupta, M. M. Hossain, N. Miller, M. Kerin, G. Callagy, and S. Gupta. NCOA3 coactivator is a transcriptional targetof XBP1 and regulates PERK-eIF2α-ATF4 signalling in breast cancer. Oncogene, 35(October 2015):1–12, apr 2016.ISSN 1476-5594. doi:10.1038/onc.2016.121. URL http://www.ncbi.nlm.nih.gov/pubmed/27109102. → pages 40

[72] D. Hanahan and R. a. Weinberg. Hallmarks of cancer: the next generation. Cell, 144(5):646–74, mar 2011. ISSN1097-4172. doi:10.1016/j.cell.2011.02.013. URL http://www.ncbi.nlm.nih.gov/pubmed/21376230. → pages 1

105

[73] E. Hodzic, R. Shrestha, K. Zhu, K. Cheng, C. C. Collins, and S. C. Sahinalp. Combinatorial detection of conservedalteration patterns for identifying cancer subnetworks. bioRxiv, 2018. doi:10.1101/369850. URLhttps://doi.org/10.1101/369850. → pages

[74] J. Hopcroft and D. Sheldon. Manipulation-resistant reputations using hitting time. In Algorithms and Models for theWeb-Graph, pages 68–81. Springer, 2007. → pages 22

[75] F. Hormozdiari, C. Alkan, E. E. Eichler, and S. C. Sahinalp. Combinatorial algorithms for structural variation detectionin high-throughput sequenced genomes. Genome research, 19(7):1270–1278, July 2009. → pages 13

[76] B. H. Hristov and M. Singh. Network-based coverage of mutational profiles reveals cancer genes. Cell Systems, 5(3):221–229.e4, 2017. ISSN 16113349. doi:10.1016/j.cels.2017.09.003. URL http://arxiv.org/abs/1704.08544. → pages76, 79, 80, 87

[77] X. Hua, H. Xu, Y. Yang, J. Zhu, P. Liu, and Y. Lu. DrGaP: A powerful tool for identifying driver genes and pathways incancer sequencing studies. American Journal of Human Genetics, 93(3):439–451, 2013. ISSN 00029297.doi:10.1016/j.ajhg.2013.07.003. URL https://www.ncbi.nlm.nih.gov/pubmed/23954162. → pages 2

[78] C. S. Hughes, S. Foehr, D. A. Garfield, E. E. Furlong, L. M. Steinmetz, and J. Krijgsveld. Ultrasensitive proteomeanalysis using paramagnetic bead technology. Molecular Systems Biology, 10(10):757–757, 2014. ISSN 1744-4292.doi:10.15252/msb.20145625. URL http://www.ncbi.nlm.nih.gov/pubmed/25358341. → pages 65

[79] C. S. Hughes, M. K. McConechy, D. R. Cochrane, T. Nazeran, A. N. Karnezis, D. G. Huntsman, and G. B. Morin.Quantitative Profiling of Single Formalin Fixed Tumour Sections: proteomics for translational research. ScientificReports, 6(1):34949, 2016. ISSN 2045-2322. doi:10.1038/srep34949. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27713570. → pages 65, 66

[80] F. Iorio, T. A. Knijnenburg, D. J. Vis, G. R. Bignell, M. P. Menden, M. Schubert, N. Aben, E. Goncalves, S. Barthorpe,H. Lightfoot, T. Cokelaer, P. Greninger, E. van Dyk, H. Chang, H. de Silva, H. Heyn, X. Deng, R. K. Egan, Q. Liu,T. Mironenko, X. Mitropoulos, L. Richardson, J. Wang, T. Zhang, S. Moran, S. Sayols, M. Soleimani, D. Tamborero,N. Lopez-Bigas, P. Ross-Macdonald, M. Esteller, N. S. Gray, D. A. Haber, M. R. Stratton, C. H. Benes, L. F. A.Wessels, J. Saez-Rodriguez, U. McDermott, and M. J. Garnett. A Landscape of Pharmacogenomic Interactions inCancer. Cell, 166(3):740–54, jul 2016. ISSN 1097-4172. doi:10.1016/j.cell.2016.06.017. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27397505. → pages 41, 44

[81] I. H. Ismail, R. Davidson, J.-P. Gagne, Z. Z. Xu, G. G. Poirier, and M. J. Hendzel. Germline mutations in BAP1 impairits function in DNA double-strand break repair. Cancer research, 74(16):4282–94, aug 2014. ISSN 1538-7445.doi:10.1158/0008-5472.CAN-13-3109. URL http://www.ncbi.nlm.nih.gov/pubmed/24894717. → pages 60

[82] P. F. Johnson. Molecular stop signs: regulation of cell-cycle arrest by C/EBP transcription factors. Journal of cellscience, 118(Pt 12):2545–55, jun 2005. ISSN 0021-9533. doi:10.1242/jcs.02459. URLhttp://www.ncbi.nlm.nih.gov/pubmed/15944395. → pages 84

[83] N. M. Joseph, Y.-y. Chen, A. Nasr, I. Yeh, E. Talevich, C. Onodera, B. C. Bastian, J. T. Rabban, K. Garg, C. Zaloudek,and D. A. Solomon. Genomic profiling of malignant peritoneal mesothelioma reveals recurrent alterations in epigeneticregulatory genes BAP1, SETD2, and DDX3X. Modern pathology : an official journal of the United States andCanadian Academy of Pathology, Inc, 30(2):246–254, 2017. ISSN 1530-0285. doi:10.1038/modpathol.2016.188. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27813512. → pages 51

[84] C. Kadoch and G. R. Crabtree. Mammalian SWI/SNF chromatin remodeling complexes and cancer: Mechanisticinsights gained from human genomics. Science Advances, 1(5):e1500447–e1500447, 2015. ISSN 2375-2548.doi:10.1126/sciadv.1500447. URL http://www.ncbi.nlm.nih.gov/pubmed/26601204. → pages 60

106

[85] S. Kato, B. N. Tomson, T. P. H. Buys, S. K. Elkin, J. L. Carter, and R. Kurzrock. Genomic Landscape of MalignantMesotheliomas. Molecular Cancer Therapeutics, 15(10):2498–2507, 2016. ISSN 1535-7163.doi:10.1158/1535-7163.MCT-16-0229. URL https://www.ncbi.nlm.nih.gov/pubmed/27507853. → pages 51, 53

[86] E. Khurana, Y. Fu, V. Colonna, X. J. Mu, H. M. Kang, T. Lappalainen, A. Sboner, L. Lochovsky, J. Chen, A. Harmanci,J. Das, A. Abyzov, S. Balasubramanian, K. Beal, D. Chakravarty, D. Challis, Y. Chen, D. Clarke, L. Clarke,F. Cunningham, U. S. Evani, P. Flicek, R. Fragoza, E. Garrison, R. Gibbs, Z. H. Gumus, J. Herrero, N. Kitabayashi,Y. Kong, K. Lage, V. Liluashvili, S. M. Lipkin, D. G. MacArthur, G. Marth, D. Muzny, T. H. Pers, G. R. S. Ritchie, J. a.Rosenfeld, C. Sisu, X. Wei, M. Wilson, Y. Xue, F. Yu, E. T. Dermitzakis, H. Yu, M. a. Rubin, C. Tyler-Smith, andM. Gerstein. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science (New York,N.Y.), 342(6154):1235587, 2013. ISSN 1095-9203. doi:10.1126/science.1235587. URLhttp://www.ncbi.nlm.nih.gov/pubmed/24092746. → pages 3

[87] Y.-A. Kim, S. Wuchty, and T. M. Przytycka. Identifying causal genes and dysregulated pathways in complex diseases.PLoS computational biology, 7(3):e1001095, Mar. 2011. ISSN 1553-7358. doi:10.1371/journal.pcbi.1001095. →pages 5

[88] Y.-A. Kim, R. Salari, S. Wuchty, and T. M. Przytycka. Module cover - a new approach to genotype-phenotype studies.Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 135–46, 2013. ISSN 2335-6936.URL http://www.ncbi.nlm.nih.gov/pubmed/23424119. → pages 76

[89] Y.-A. Kim, D.-Y. Cho, P. Dao, and T. M. Przytycka. MEMCover: integrated analysis of mutual exclusivity andfunctional network reveals dysregulated pathways across multiple cancer types. Bioinformatics (Oxford, England), 31(12):i284–92, jun 2015. ISSN 1367-4811. doi:10.1093/bioinformatics/btv247. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26072494. → pages 76, 88

[90] J. C. King, J. Xu, J. Wongvipat, H. Hieronymus, B. S. Carver, D. H. Leung, B. S. Taylor, C. Sander, R. D. Cardiff, S. S.Couto, W. L. Gerald, and C. L. Sawyers. Cooperativity of TMPRSS2-ERG with PI3-kinase pathway activation inprostate oncogenesis. Nature Genetics, 41(5):524–526, 2009. ISSN 1061-4036. doi:10.1038/ng.371. URLhttps://www.ncbi.nlm.nih.gov/pubmed/19396167. → pages 75

[91] S. Kohler, S. Bauer, D. Horn, and P. N. Robinson. Walking the Interactome for Prioritization of Candidate DiseaseGenes. American Journal of Human Genetics, 82(4):949–958, 2008. ISSN 00029297. doi:10.1016/j.ajhg.2008.02.013.URL http://www.cell.com/AJHG/abstract/S0002-9297(08)00172-9. → pages 6

[92] R. I. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of theNineteenth International Conference on Machine Learning, ICML ’02, pages 315–322, San Francisco, CA, USA,2002. Morgan Kaufmann Publishers Inc. ISBN 1-55860-873-7. URLhttp://dl.acm.org/citation.cfm?id=645531.655996. → pages 6

[93] K. J. Kron, A. Murison, S. Zhou, V. Huang, T. N. Yamaguchi, Y.-J. Shiah, M. Fraser, T. van der Kwast, P. C. Boutros,R. G. Bristow, and M. Lupien. TMPRSS2ERG fusion co-opts master transcription factors and activates NOTCHsignaling in primary prostate cancer. Nature Genetics, 49(9):1336–1345, 2017. ISSN 1061-4036.doi:10.1038/ng.3930. URL https://www.ncbi.nlm.nih.gov/pubmed/28783165. → pages 75

[94] A. Lan, I. Y. Smoly, G. Rapaport, S. Lindquist, E. Fraenkel, and E. Yeger-Lotem. ResponseNet: Revealing signalingand regulatory networks linking genetic and transcriptomic screening data. Nucleic Acids Research, 39(SUPPL. 2):424–429, 2011. ISSN 03051048. doi:10.1093/nar/gkr359. → pages 6

[95] S. Landreville, O. A. Agapova, K. A. Matatall, Z. T. Kneass, M. D. Onken, R. S. Lee, A. M. Bowcock, and J. W.Harbour. Histone deacetylase inhibitors induce growth arrest and differentiation in uveal melanoma. Clinical CancerResearch, 18(2):408–416, 2012. ISSN 10780432. doi:10.1158/1078-0432.CCR-11-0946. URLhttps://www.ncbi.nlm.nih.gov/pubmed/22038994. → pages 60

107

[96] M. S. Lawrence, P. Stojanov, P. Polak, G. V. Kryukov, K. Cibulskis, A. Sivachenko, S. L. Carter, C. Stewart, C. H.Mermel, S. a. Roberts, A. Kiezun, P. S. Hammerman, A. McKenna, Y. Drier, L. Zou, A. H. Ramos, T. J. Pugh,N. Stransky, E. Helman, J. Kim, C. Sougnez, L. Ambrogio, E. Nickerson, E. Shefler, M. L. Cortes, D. Auclair,G. Saksena, D. Voet, M. Noble, D. DiCara, P. Lin, L. Lichtenstein, D. I. Heiman, T. Fennell, M. Imielinski,B. Hernandez, E. Hodis, S. Baca, A. M. Dulak, J. Lohr, D.-A. Landau, C. J. Wu, J. Melendez-Zajgla,A. Hidalgo-Miranda, A. Koren, S. a. McCarroll, J. Mora, R. S. Lee, B. Crompton, R. Onofrio, M. Parkin, W. Winckler,K. Ardlie, S. B. Gabriel, C. W. M. Roberts, J. a. Biegel, K. Stegmaier, A. J. Bass, L. a. Garraway, M. Meyerson, T. R.Golub, D. a. Gordenin, S. Sunyaev, E. S. Lander, and G. Getz. Mutational heterogeneity in cancer and the search fornew cancer-associated genes. Nature, 499(7457):214–8, July 2013. ISSN 1476-4687. doi:10.1038/nature12213. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23770567. → pages 2

[97] D. T. Le, J. N. Uram, H. Wang, B. R. Bartlett, H. Kemberling, A. D. Eyring, A. D. Skora, B. S. Luber, N. S. Azad,D. Laheru, B. Biedrzycki, R. C. Donehower, A. Zaheer, G. A. Fisher, T. S. Crocenzi, J. J. Lee, S. M. Duffy, R. M.Goldberg, A. de la Chapelle, M. Koshiji, F. Bhaijee, T. Huebner, R. H. Hruban, L. D. Wood, N. Cuka, D. M. Pardoll,N. Papadopoulos, K. W. Kinzler, S. Zhou, T. C. Cornish, J. M. Taube, R. A. Anders, J. R. Eshleman, B. Vogelstein, andL. A. Diaz. PD-1 Blockade in Tumors with Mismatch-Repair Deficiency. New England Journal of Medicine, 372(26):2509–2520, 2015. ISSN 0028-4793. doi:10.1056/NEJMoa1500596. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26028255. → pages 60

[98] D. T. Le, J. N. Durham, K. N. Smith, H. Wang, B. R. Bartlett, L. K. Aulakh, S. Lu, H. Kemberling, C. Wilt, B. S. Luber,F. Wong, N. S. Azad, A. A. Rucki, D. Laheru, R. Donehower, A. Zaheer, G. A. Fisher, T. S. Crocenzi, J. J. Lee, T. F.Greten, A. G. Duffy, K. K. Ciombor, A. D. Eyring, B. H. Lam, A. Joe, S. P. Kang, M. Holdhoff, L. Danilova, L. Cope,C. Meyer, S. Zhou, R. M. Goldberg, D. K. Armstrong, K. M. Bever, A. N. Fader, J. Taube, F. Housseau, D. Spetzler,N. Xiao, D. M. Pardoll, N. Papadopoulos, K. W. Kinzler, J. R. Eshleman, B. Vogelstein, R. A. Anders, and L. A. Diaz.Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science (New York, N.Y.), 357(6349):409–413, 2017. ISSN 1095-9203. doi:10.1126/science.aan6733. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28596308. → pages 60

[99] N. Leblay, F. Lepretre, N. Le Stang, A. Gautier-Stein, L. Villeneuve, S. Isaac, D. Maillet, F. Galateau-Salle, C. Villenet,S. Sebda, A. Goracci, G. Byrnes, J. D. McKay, M. Figeac, O. Glehen, F. N. Gilly, M. Foll, L. Fernandez-Cuesta, andM. Brevet. BAP1 Is Altered by Copy Number Loss, Mutation, and/or Loss of Protein Expression in More Than 70%ofMalignant Peritoneal Mesotheliomas. Journal of Thoracic Oncology, 12(4):724–733, 2017. ISSN 15561380.doi:10.1016/j.jtho.2016.12.019. URL https://www.ncbi.nlm.nih.gov/pubmed/28034829. → pages 51

[100] B. D. Lehmann, J. A. Bauer, X. Chen, M. E. Sanders, A. B. Chakravarthy, Y. Shyr, and J. A. Pietenpol. Identification ofhuman triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. The Journal ofclinical investigation, 121(7):2750–67, jul 2011. ISSN 1558-8238. doi:10.1172/JCI45014. → pages 41

[101] M. D. M. Leiserson, D. Blokh, R. Sharan, and B. J. Raphael. Simultaneous identification of multiple driver pathways incancer. PLoS computational biology, 9(5):e1003054, May 2013. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003054.→ pages 4

[102] M. D. M. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V. Eldridge, J. L. Thomas, A. Papoutsaki, Y. Kim, B. Niu,M. McLellan, M. S. Lawrence, A. Gonzalez-Perez, D. Tamborero, Y. Cheng, G. A. Ryslik, N. Lopez-Bigas, G. Getz,L. Ding, and B. J. Raphael. Pan-cancer network analysis identifies combinations of rare somatic mutations acrosspathways and protein complexes. Nature Genetics, 47(2):106–114, 2014. ISSN 1061-4036. doi:10.1038/ng.3168.URL http://www.ncbi.nlm.nih.gov/pubmed/25501392. → pages 76

[103] C. K.-S. Leung. Anti-monotone Constraints, pages 98–98. Springer US, Boston, MA, 2009. ISBN 978-0-387-39940-9.doi:10.1007/978-0-387-39940-9 5046. URL https://doi.org/10.1007/978-0-387-39940-9 5046. → pages 80

[104] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2008. →pages 12

108

[105] W. Li, J. Cooper, L. Zhou, C. Yang, H. Erdjument-Bromage, D. Zagzag, M. Snuderl, M. Ladanyi, C. O. Hanemann,P. Zhou, M. A. Karajannis, and F. G. Giancotti. Merlin/NF2 loss-driven tumorigenesis linked toCRL4(DCAF1)-mediated inhibition of the hippo pathway kinases Lats1 and 2 in the nucleus. Cancer cell, 26(1):48–60, jul 2014. ISSN 1878-3686. doi:10.1016/j.ccr.2014.05.001. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25026211. → pages 55

[106] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Societyfor Information Science and Technology, 58(7):1019–1031, 2007. ISSN 15322882. doi:10.1002/asi.20591. → pages 7,12

[107] F. Lin, M. C. De Gooijer, E. M. Roig, L. C. M. Buil, S. M. Christner, J. H. Beumer, T. WEurdinger, J. H. Beijnen, andO. Van Tellingen. ABCB1, ABCG2, and PTEN determine the response of glioblastoma to temozolomide and ABT-888therapy. Clinical Cancer Research, 20(10):2703–2713, 2014. ISSN 15573265. doi:10.1158/1078-0432.CCR-14-0084.→ pages 38

[108] A. A. Loboda, M. N. Artyomov, and A. A. S. B. Solving Generalized Maximum-Weight Connected Subgraph Problemfor Network Enrichment Analysis. In M. Frith and C. N. and Storm Pedersen, editors, Algorithms in Bioinformatics,pages 210–221, Cham, 2016. Springer International Publishing. ISBN 978-3-319-43681-4.doi:10.1007/978-3-319-43681-4 17. URL http://link.springer.com/10.1007/978-3-642-33122-0. → pages 76

[109] I. S. U. Luk, R. Shrestha, H. Xue, Y. Wang, F. Zhang, D. Lin, A. Haegert, R. Wu, X. Dong, C. C. Collins, A. Zoubeidi,M. E. Gleave, P. W. Gout, and Y. Wang. BIRC6 Targeting as Potential Therapy for Advanced, Enzalutamide-ResistantProstate Cancer. Clinical cancer research, 23(6):1542–1551, mar 2017. ISSN 1078-0432.doi:10.1158/1078-0432.CCR-16-0718. URL http://www.ncbi.nlm.nih.gov/pubmed/27663589. → pages 9

[110] M. Maio, A. Scherpereel, L. Calabro, J. Aerts, S. C. Perez, A. Bearz, K. Nackaerts, D. A. Fennell, D. Kowalski, A. S.Tsao, P. Taylor, F. Grosso, S. J. Antonia, A. K. Nowak, M. Taboada, M. Puglisi, P. K. Stockman, and H. L. Kindler.Tremelimumab as second-line or third-line treatment in relapsed malignant mesothelioma (DETERMINE): amulticentre, international, randomised, double-blind, placebo-controlled phase 2b trial. The Lancet Oncology, pages1–13, 2017. ISSN 14702045. doi:10.1016/S1470-2045(17)30446-1. URLhttps://www.ncbi.nlm.nih.gov/pubmed/28729154. → pages 51, 61

[111] J. Marquart, E. Y. Chen, and V. Prasad. Estimation of The Percentage of US Patients With Cancer Who Benefit FromGenome-Driven Oncology. JAMA Oncology, 97239:1–7, apr 2018. ISSN 2374-2437.doi:10.1001/jamaoncol.2018.1660. URL http://dx.doi.org/10.1001/jamaoncol.2018.1660. → pages 43

[112] D. L. Masica and R. Karchin. Correlation of somatic mutation and expression identifies genes important in humanglioblastoma progression and survival. Cancer research, 71(13):4550–61, July 2011. ISSN 1538-7445.doi:10.1158/0008-5472.CAN-11-0180. → pages 4

[113] S. Maxwell, M. R. Chance, and M. Koyuturk. Efficiently Enumerating All Connected Induced Subgraphs of a LargeMolecular Network. In A.-H. Dediu, , C. Mart\’in-Vide, , and B. Truthe, editors, Algorithms for ComputationalBiology, pages 171–182, Cham, 2014. Springer International Publishing. ISBN 978-3-319-07953-0.doi:10.1007/978-3-319-07953-0 14. URL http://link.springer.com/10.1007/978-3-319-07953-0 14. → pages 80

[114] A. McPherson, F. Hormozdiari, A. Zayed, R. Giuliany, G. Ha, M. G. F. Sun, M. Griffith, A. Heravi Moussavi, J. Senz,N. Melnyk, M. Pacheco, M. A. Marra, M. Hirst, T. O. Nielsen, S. C. Sahinalp, D. Huntsman, and S. P. Shah. deFuse: analgorithm for gene fusion discovery in tumor RNA-Seq data. PLoS computational biology, 7(5):e1001138, may 2011.ISSN 1553-7358. doi:10.1371/journal.pcbi.1001138. URL http://www.ncbi.nlm.nih.gov/pubmed/21625565. → pages55, 65

109

[115] C. H. Mermel, S. E. Schumacher, B. Hill, M. L. Meyerson, R. Beroukhim, and G. Getz. GISTIC2.0 facilitates sensitiveand confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biology,12(4):R41, 2011. ISSN 1465-6906. doi:10.1186/gb-2011-12-4-r41. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21527027. → pages 3, 54

[116] D. Miao, C. A. Margolis, W. Gao, M. H. Voss, W. Li, D. J. Martini, C. Norton, D. Bosse, S. M. Wankowicz, D. Cullen,C. Horak, M. Wind-Rotolo, A. Tracy, M. Giannakis, F. S. Hodi, C. G. Drake, M. W. Ball, M. E. Allaf, A. Snyder, M. D.Hellmann, T. Ho, R. J. Motzer, S. Signoretti, W. G. Kaelin, T. K. Choueiri, and E. M. Van Allen. Genomic correlates ofresponse to immune checkpoint therapies in clear cell renal cell carcinoma. Science (New York, N.Y.), 5951(January):1–11, jan 2018. ISSN 1095-9203. doi:10.1126/science.aan5951. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29301960. → pages 60

[117] C. a. Miller, S. H. Settle, E. P. Sulman, K. D. Aldape, and A. Milosavljevic. Discovering functional modules byidentifying recurrent and mutually exclusive mutational patterns in tumors. BMC medical genomics, 4(1):34, 2011.ISSN 1755-8794. doi:10.1186/1755-8794-4-34. URL http://www.biomedcentral.com/1755-8794/4/34. → pages 4, 76

[118] M. Mina, F. Raynaud, D. Tavernari, E. Battistello, S. Sungalee, S. Saghafinia, T. Laessle, F. Sanchez-Vega, N. Schultz,E. Oricchio, and G. Ciriello. Conditional Selection of Genomic Alterations Dictates Cancer Evolution and OncogenicDependencies. Cancer cell, 29(0):723–736, jul 2017. ISSN 1878-3686. doi:10.1016/j.ccell.2017.06.010. → pages 76

[119] G. Minuti and L. Landi. MET deregulation in breast cancer. Annals of translational medicine, 3(13):181, aug 2015.ISSN 2305-5839. doi:10.3978/j.issn.2305-5839.2015.06.22. URL http://www.ncbi.nlm.nih.gov/pubmed/26366398. →pages 84

[120] L. Montanaro, D. Trere, and M. Derenzini. Nucleolus, ribosomes, and cancer. The American journal of pathology, 173(2):301–10, aug 2008. ISSN 1525-2191. doi:10.2353/ajpath.2008.070752. URLhttp://www.ncbi.nlm.nih.gov/pubmed/18583314. → pages 85

[121] K. W. Mouw, M. S. Goldberg, P. A. Konstantinopoulos, and A. D. D’Andrea. DNA Damage and Repair Biomarkers ofImmunotherapy Response. Cancer discovery, 7(7):675–693, 2017. ISSN 2159-8290.doi:10.1158/2159-8290.CD-17-0226. URL http://www.ncbi.nlm.nih.gov/pubmed/28630051. → pages 60

[122] A. Murat, E. Migliavacca, T. Gorlia, W. L. Lambiv, T. Shay, M.-F. Hamou, N. de Tribolet, L. Regli, W. Wick, M. C. M.Kouwenhoven, J. a. Hainfellner, F. L. Heppner, P.-Y. Dietrich, Y. Zimmer, J. G. Cairncross, R.-c. Janzer, E. Domany,M. Delorenzi, R. Stupp, and M. E. Hegi. Stem cell-related ”self-renewal” signature and high epidermal growth factorreceptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. Journal of clinicaloncology, 26(18):3015–24, jun 2008. ISSN 1527-7755. doi:10.1200/JCO.2007.15.7164. URLhttp://www.ncbi.nlm.nih.gov/pubmed/18565887. → pages 24

[123] S. C. Muthukrishnan, S. and Sahinalp. Approximate nearest neighbors and sequence comparison with block operations.In Proceedings of the Thirty-second Annual ACM Symposium on Theory of Computing, pages 416–424, New York,2000. ACM. ISBN 1581131844. doi:10.1145/335305.335353. → pages 78

[124] A. M. Newman, C. L. Liu, M. R. Green, A. J. Gentles, W. Feng, Y. Xu, C. D. Hoang, M. Diehn, and A. A. Alizadeh.Robust enumeration of cell subsets from tissue expression profiles. Nature methods, 12(5):453–7, may 2015. ISSN1548-7105. doi:10.1038/nmeth.3337. URL http://www.ncbi.nlm.nih.gov/pubmed/25822800. → pages 58, 68

[125] S. Ng, E. a. Collisson, A. Sokolov, T. Goldstein, A. Onzalez-Perez, N. Lopez-Bigas, C. Benz, D. Haussler, and J. M.Stuart. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis.Bioinformatics, 28(18):640–646, 2012. ISSN 13674803. doi:10.1093/bioinformatics/bts402. → pages 4

110

[126] C. K. Osborne, V. Bardou, T. A. Hopp, G. C. Chamness, S. G. Hilsenbeck, S. A. W. Fuqua, J. Wong, D. C. Allred,G. M. Clark, and R. Schiff. Role of the estrogen receptor coactivator AIB1 (SRC-3) and HER-2/neu in tamoxifenresistance in breast cancer. Journal of the National Cancer Institute, 95(5):353–61, mar 2003. ISSN 0027-8874.doi:10.1017/CBO9781107415324.004. URL http://www.ncbi.nlm.nih.gov/pubmed/12618500. → pages 40

[127] D. Pan, A. Kobayashi, P. Jiang, L. Ferrari de Andrade, R. E. Tay, A. Luoma, D. Tsoucas, X. Qiu, K. Lim, P. Rao, H. W.Long, G.-c. Yuan, J. Doench, M. Brown, S. Liu, and K. W. Wucherpfennig. A major chromatin regulator determinesresistance of tumor cells to T cell-mediated killing. Science (New York, N.Y.), 1710(January):1–12, jan 2018. ISSN1095-9203. doi:10.1126/science.aao1710. URL http://www.ncbi.nlm.nih.gov/pubmed/29301958. → pages 60

[128] D. W. Parsons, S. Jones, X. Zhang, J. C.-H. Lin, R. J. Leary, P. Angenendt, et al. An integrated genomic analysis ofhuman glioblastoma multiforme. Science (New York, N.Y.), 321(5897):1807–12, Sept. 2008. ISSN 1095-9203.doi:10.1126/science.1164382. URL https://www.ncbi.nlm.nih.gov/pubmed/18772396. → pages 3, 35

[129] A.-M. Patch, E. L. Christie, D. Etemadmoghadam, D. W. Garsed, J. George, S. Fereday, K. Nones, P. Cowin, K. Alsop,P. J. Bailey, K. S. Kassahn, F. Newell, M. C. J. Quinn, S. Kazakoff, K. Quek, C. Wilhelm-Benartzi, E. Curry, H. S.Leong, A. Hamilton, L. Mileshkin, G. Au-Yeung, C. Kennedy, J. Hung, Y.-E. Chiew, P. Harnett, M. Friedlander,M. Quinn, J. Pyman, S. Cordner, P. OBrien, J. Leditschke, G. Young, K. Strachan, P. Waring, W. Azar, C. Mitchell,N. Traficante, J. Hendley, H. Thorne, M. Shackleton, D. K. Miller, G. M. Arnau, R. W. Tothill, T. P. Holloway,T. Semple, I. Harliwong, C. Nourse, E. Nourbakhsh, S. Manning, S. Idrisoglu, T. J. C. Bruxner, A. N. Christ, B. Poudel,O. Holmes, M. Anderson, C. Leonard, A. Lonie, N. Hall, S. Wood, D. F. Taylor, Q. Xu, J. L. Fink, N. Waddell,R. Drapkin, E. Stronach, H. Gabra, R. Brown, A. Jewell, S. H. Nagaraj, E. Markham, P. J. Wilson, J. Ellul, O. McNally,M. a. Doyle, R. Vedururu, C. Stewart, E. Lengyel, J. V. Pearson, N. Waddell, A. DeFazio, S. M. Grimmond, andD. D. L. Bowtell. Wholegenome characterization of chemoresistant ovarian cancer. Nature, 521(7553):489–494, 2015.ISSN 0028-0836. doi:10.1038/nature14410. → pages 36

[130] E. O. Paull, D. E. Carlin, M. Niepel, P. K. Sorger, D. Haussler, et al. Discovering causal pathways linking genomicevents to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics (Oxford,England), pages 1–8, Sept. 2013. ISSN 1367-4811. doi:10.1093/bioinformatics/btt471. → pages 6

[131] J. Pelletier, G. Thomas, and S. Volarevic. Ribosome biogenesis in cancer: new players and therapeutic avenues. Naturereviews. Cancer, 18(1):51–63, jan 2018. ISSN 1474-1768. doi:10.1038/nrc.2017.104. URLhttp://www.ncbi.nlm.nih.gov/pubmed/29192214. → pages 85

[132] S. Pena-Llopis, S. Vega-Rubın-de Celis, A. Liao, N. Leng, A. Pavıa-Jimenez, S. Wang, T. Yamasaki, L. Zhrebker,S. Sivanand, P. Spence, L. Kinch, T. Hambuch, S. Jain, Y. Lotan, V. Margulis, A. I. Sagalowsky, P. B. Summerour,W. Kabbani, S. W. W. Wong, N. Grishin, M. Laurent, X.-J. Xie, C. D. Haudenschild, M. T. Ross, D. R. Bentley,P. Kapur, and J. Brugarolas. BAP1 loss defines a new class of renal cell carcinoma. Nature Genetics, 44(7):751–759,2012. ISSN 1061-4036. doi:10.1038/ng.2323. URL https://www.ncbi.nlm.nih.gov/pubmed/22683710. → pages 60

[133] C. M. Perou, T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen,L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lønning, A. L. Børresen-Dale, P. O. Brown,and D. Botstein. Molecular portraits of human breast tumours. Nature, 406(6797):747–52, aug 2000. ISSN 0028-0836.doi:10.1038/35021093. → pages 88

[134] T. S. K. Prasad, K. Kandasamy, and A. Pandey. Human Protein Reference Database and Human Proteinpedia asdiscovery tools for systems biology. Methods in molecular biology (Clifton, N.J.), 577:67–79, jan 2009. ISSN1940-6029. doi:10.1007/978-1-60761-232-2 6. URL http://www.ncbi.nlm.nih.gov/pubmed/19718509. → pages 24

[135] V. Prasad. Perspective: The precision-oncology illusion. Nature, 537(7619):S63–S63, Sep 2016. ISSN 0028-0836.URL http://dx.doi.org/10.1038/537S63a. Outlook. → pages 43

111

[136] Y. Qi, Y. Suhail, Y.-y. Lin, J. D. Boeke, and J. S. Bader. Finding friends and enemies in an enemies-only network: agraph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast geneticinteractions. Genome research, 18(12):1991–2004, Dec. 2008. ISSN 1088-9051. doi:10.1101/gr.077693.108. → pages6

[137] J. Reimand and G. D. Bader. Systematic analysis of somatic mutations in phosphorylation signaling predicts novelcancer drivers. Molecular systems biology, 9(637):637, 2013. ISSN 1744-4292. doi:10.1038/msb.2012.68. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23340843. → pages 3

[138] S. Ren, G.-H. Wei, D. Liu, L. Wang, Y. Hou, S. Zhu, L. Peng, Q. Zhang, Y. Cheng, H. Su, X. Zhou, J. Zhang, F. Li,H. Zheng, Z. Zhao, C. Yin, Z. He, X. Gao, H. E. Zhau, C.-Y. Chu, J. B. Wu, C. Collins, S. V. Volik, R. Bell, J. Huang,K. Wu, D. Xu, D. Ye, Y. Yu, L. Zhu, M. Qiao, H.-M. Lee, Y. Yang, Y. Zhu, X. Shi, R. Chen, Y. Wang, W. Xu, Y. Cheng,C. Xu, X. Gao, T. Zhou, B. Yang, J. Hou, L. Liu, Z. Zhang, Y. Zhu, C. Qin, P. Shao, J. Pang, L. W. Chung, J. Xu, C.-L.Wu, W. Zhong, X. Xu, Y. Li, X. Zhang, J. Wang, H. Yang, J. Wang, H. Huang, and Y. Sun. Whole-genome andTranscriptome Sequencing of Prostate Cancer Identify New Genetic Alterations Driving Disease Progression.European Urology, 73(3):322–339, mar 2018. ISSN 03022838. doi:10.1016/j.eururo.2017.08.027. URLhttp://www.ncbi.nlm.nih.gov/pubmed/28927585. → pages 24

[139] B. Reva, Y. Antipin, and C. Sander. Predicting the functional impact of protein mutations: Application to cancergenomics. Nucleic Acids Research, 39(17):37–43, 2011. ISSN 03051048. doi:10.1093/nar/gkr407. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21727090. → pages 3

[140] A. L. Richardson, Z. C. Wang, A. De Nicolo, X. Lu, M. Brown, A. Miron, X. Liao, J. D. Iglehart, D. M. Livingston,and S. Ganesan. X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell, 9(2):121–132, 2006.ISSN 15356108. doi:10.1016/j.ccr.2006.01.013. → pages 24

[141] D. S. Rickman, T. D. Soong, B. Moss, J. M. Mosquera, J. Dlabal, S. Terry, T. MacDonald, K. Bunting, F. Demichelis,A. Melnick, O. Elemento, and M. a. Rubin. Oncogene-mediated alterations in chromatin conformation. Proceedings ofthe National Academy of Sciences of the United States of America, 109(23):9083–9088, 2012. ISSN 0008-5472. →pages 24

[142] A. Robertson, J. Shih, C. Yau, E. Gibb, J. Oba, K. Mungall, J. Hess, V. Uzunangelov, V. Walter, L. Danilova,T. Lichtenberg, M. Kucherlapati, P. Kimes, M. Tang, A. Penson, O. Babur, R. Akbani, C. Bristow, K. Hoadley, L. Iype,M. Chang, M. Abdel-Rahman, R. Akbani, A. Ally, J. Auman, O. Babur, M. Balasundaram, S. Balu, C. Benz,R. Beroukhim, I. Birol, T. Bodenheimer, J. Bowen, R. Bowlby, C. Bristow, D. Brooks, R. Carlsen, C. Cebulla,M. Chang, A. Cherniack, L. Chin, J. Cho, E. Chuah, S. Chudamani, C. Cibulskis, K. Cibulskis, L. Cope, S. Coupland,L. Danilova, T. Defreitas, J. Demchok, L. Desjardins, N. Dhalla, B. Esmaeli, I. Felau, M. Ferguson, S. Frazer,S. Gabriel, J. Gastier-Foster, N. Gehlenborg, M. Gerken, J. Gershenwald, G. Getz, E. Gibb, K. Griewank, E. Grimm,D. Hayes, A. Hegde, D. Heiman, C. Helsel, J. Hess, K. Hoadley, S. Hobensack, R. Holt, A. Hoyle, X. Hu, C. Hutter,M. Jager, S. Jefferys, C. Jones, S. Jones, C. Kandoth, K. Kasaian, J. Kim, P. Kimes, M. Kucherlapati, R. Kucherlapati,E. Lander, M. Lawrence, A. Lazar, S. Lee, K. Leraas, T. Lichtenberg, P. Lin, J. Liu, W. Liu, L. Lolla, Y. Lu, L. Iype,Y. Ma, H. Mahadeshwar, O. Mariani, M. Marra, M. Mayo, S. Meier, S. Meng, M. Meyerson, P. Mieczkowski, G. Mills,R. Moore, L. Mose, A. Mungall, K. Mungall, B. Murray, R. Naresh, M. Noble, J. Oba, A. Pantazi, M. Parfenov, P. Park,J. Parker, A. Penson, C. Perou, T. Pihl, R. Pilarski, A. Protopopov, A. Radenbaugh, K. Rai, N. Ramirez, X. Ren,S. Reynolds, J. Roach, A. Robertson, S. Roman-Roman, J. Roszik, S. Sadeghi, G. Saksena, X. Sastre, D. Schadendorf,J. Schein, L. Schoenfield, S. Schumacher, J. Seidman, S. Seth, G. Sethi, M. Sheth, Y. Shi, C. Shields, J. Shih,I. Shmulevich, J. Simons, A. Singh, P. Sipahimalani, T. Skelly, H. Sofia, M. Soloway, X. Song, M.-H. Stern, J. Stuart,Q. Sun, H. Sun, A. Tam, D. Tan, M. Tang, J. Tang, R. Tarnuzzer, B. Taylor, N. Thiessen, V. Thorsson, K. Tse,V. Uzunangelov, U. Veluvolu, R. Verhaak, D. Voet, V. Walter, Y. Wan, Z. Wang, J. Weinstein, M. Wilkerson,M. Williams, L. Wise, S. Woodman, T. Wong, Y. Wu, L. Yang, L. Yang, C. Yau, J. Zenklusen, J. Zhang, H. Zhang,E. Zmuda, A. Cherniack, C. Benz, G. Mills, R. Verhaak, K. Griewank, I. Felau, J. Zenklusen, J. Gershenwald,L. Schoenfield, A. Lazar, M. Abdel-Rahman, S. Roman-Roman, M.-H. Stern, C. Cebulla, M. Williams, M. Jager,

112

S. Coupland, B. Esmaeli, C. Kandoth, and S. Woodman. Integrative Analysis Identifies Four Molecular and ClinicalSubsets in Uveal Melanoma. Cancer Cell, 32(2):204–220, 2017. ISSN 18783686. doi:10.1016/j.ccell.2017.07.003.URL https://www.ncbi.nlm.nih.gov/pubmed/28810145. → pages 60, 75

[143] R. Rosenthal, N. McGranahan, J. Herrero, B. S. Taylor, and C. Swanton. deconstructSigs: delineating mutationalprocesses in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. GenomeBiology, 17(1):31, 2016. ISSN 1474-760X. doi:10.1186/s13059-016-0893-4. URLhttps://www.ncbi.nlm.nih.gov/pubmed/26899170. → pages 52, 67

[144] B. Rosner. Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics), 25(2):165–172, 2013.→ pages 24, 83

[145] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo, G. Frishman, C. Montrone, and H. W.Mewes. CORUM: The comprehensive resource of mammalian protein complexes-2009. Nucleic Acids Research, 38(SUPPL.1):497–501, 2009. ISSN 03051048. doi:10.1093/nar/gkp914. URLhttp://www.ncbi.nlm.nih.gov/pubmed/19884131. → pages 57

[146] J. J. Sacco, J. Kenyani, Z. Butt, R. Carter, H. Y. Chew, L. P. Cheeseman, S. Darling, M. Denny, S. Urbe, M. J. Clague,and J. M. Coulson. Loss of the deubiquitylase BAP1 alters class I histone deacetylase expression and sensitivity ofmesothelioma cells to HDAC inhibitors. Oncotarget, 6(15):13757–71, 2015. ISSN 1949-2553.doi:10.18632/oncotarget.3765. URL http://www.ncbi.nlm.nih.gov/pubmed/25970771. → pages 60

[147] F. Sanchez-Garcia, U. D. Akavia, E. Mozes, and D. Pe’er. JISTIC: identification of significant targets in cancer. BMCbioinformatics, 11:189, 2010. ISSN 1471-2105. doi:10.1186/1471-2105-11-189. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20398270. → pages 3

[148] R. F. Schwarz, A. Trinh, B. Sipos, J. D. Brenton, N. Goldman, and F. Markowetz. Phylogenetic quantification ofintra-tumour heterogeneity. PLoS computational biology, 10(4):e1003535, apr 2014. ISSN 1553-7358.doi:10.1371/journal.pcbi.1003535. URL http://www.ncbi.nlm.nih.gov/pubmed/24743184. → pages 78

[149] H. Sharifi-Noghabi, Y. Liu, N. Erho, R. Shrestha, M. Alshalalfa, E. Davicioni, C. C. Collins, and M. Ester. Deepgenomic signature for early metastasis prediction in prostate cancer. bioRxiv, 2018. doi:10.1101/276055. URLhttps://doi.org/10.1101/276055. → pages 9

[150] N. L. Sharma, C. E. Massie, A. Ramos-Montoya, V. Zecchini, H. E. Scott, A. D. Lamb, S. MacArthur, R. Stark, A. Y.Warren, I. G. Mills, and D. E. Neal. The Androgen Receptor Induces a Distinct Transcriptional Program inCastration-Resistant Prostate Cancer in Man. Cancer Cell, 23(1):35–47, 2013. ISSN 15356108.doi:10.1016/j.ccr.2012.11.010. → pages 24

[151] B. S. Sheffield, A. V. Tinker, Y. Shen, H. Hwang, H. H. Li-Chang, E. Pleasance, C. Ch’ng, A. Lum, J. Lorette, Y. J.McConnell, S. Sun, S. J. Jones, A. M. Gown, D. G. Huntsman, D. F. Schaeffer, A. Churg, S. Yip, J. Laskin, and M. A.Marra. Personalized oncogenomics: Clinical experience with malignant peritoneal mesothelioma using whole genomesequencing. PLoS ONE, 10(3):1–12, 2015. ISSN 19326203. doi:10.1371/journal.pone.0119689. URLhttps://www.ncbi.nlm.nih.gov/pubmed/25798586. → pages 51

[152] M. F. Shlesinger. Mathematical physics: first encounters. Nature, 450(7166):40–41, 2007. ISSN 0028-0836.doi:10.1038/450040a. → pages 7

[153] I. Shmulevich, E. R. Dougherty, and W. Zhang. Gene perturbation and intervention in probabilistic Boolean networks.Bioinformatics (Oxford, England), 18(10):1319–1331, 2002. ISSN 1367-4803, 1460-2059.doi:10.1093/bioinformatics/18.10.1319. → pages 7

113

[154] R. Shrestha, E. Hodzic, J. Yeung, K. Wang, T. Sauerwald, P. Dao, S. Anderson, H. Beltran, M. A. Rubin, C. C. Collins,G. Haffari, and S. C. Sahinalp. HIT’nDRIVE: Multi-driver gene prioritization based on hitting time. Research inComputational Molecular Biology: 18th Annual International Conference, RECOMB 2014, Pittsburgh, PA, USA, April2-5, 2014, Proceedings, pages 293–306, 2014. doi:10.1007/978-3-319-05269-4 23. URLhttp://dx.doi.org/10.1007/978-3-319-05269-4 23. → pages 7, 11

[155] R. Shrestha, E. Hodzic, T. Sauerwald, P. Dao, K. Wang, J. Yeung, S. Anderson, F. Vandin, G. Haffari, C. C. Collins, andS. C. Sahinalp. HIT’nDRIVE: patient-specific multidriver gene prioritization for precision oncology. Genome research,27(9):1573–1588, sep 2017. ISSN 1549-5469. doi:10.1101/gr.221218.117. URLhttps://www.ncbi.nlm.nih.gov/pubmed/28768687. → pages 7, 11, 52, 67, 75

[156] R. Shrestha, N. Nabavi, Y.-Y. Lin, F. Mo, S. Anderson, S. Volik, H. H. Adomat, D. Lin, H. Xue, X. Dong, R. Shukin,R. H. Bell, B. McConeghy, A. Haegert, S. Brahmbhatt, E. Li, H. Z. Oo, A. Hurtado-Coll, L. Fazli, J. Zhou,Y. McConnell, A. McCart, A. Lowy, G. B. Morin, M. Daugaard, S. C. Sahinalp, F. Hach, S. Le Bihan, M. E. Gleave,Y. Wang, A. Churg, and C. C. Collins. Integrated Multi-omics Molecular Subtyping Predicts Therapeutic Vulnerabilityin Malignant Peritoneal Mesothelioma. bioRxiv, 2018. doi:10.1101/243477. URL https://doi.org/10.1101/2434777. →pages 8, 51

[157] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng. SIFT web server: predicting effects of amino acidsubstitutions on proteins. Nucleic acids research, 40(Web Server issue):W452–7, 2012. ISSN 1362-4962.doi:10.1093/nar/gks539. URL https://www.ncbi.nlm.nih.gov/pubmed/22689647. → pages 3

[158] A. D. Singhi, A. M. Krasinskas, H. A. Choudry, D. L. Bartlett, J. F. Pingpank, H. J. Zeh, A. Luvison, K. Fuhrer,N. Bahary, R. R. Seethala, and S. Dacic. The prognostic significance of BAP1, NF2, and CDKN2A in malignantperitoneal mesothelioma. Modern pathology : an official journal of the United States and Canadian Academy ofPathology, Inc, 29(1):14–24, 2016. ISSN 1530-0285. doi:10.1038/modpathol.2015.121. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26493618. → pages 51

[159] T. Sjoblom, L. D. Wood, D. W. Parsons, J. Lin, T. D. Barber, D. Mandelker, R. J. Leary, J. Ptak, N. Silliman, S. Szabo,P. Buckhaults, C. Farrell, P. Meeh, S. D. Markowitz, J. Willis, D. Dawson, J. K. V. Willson, A. F. Gazdar, J. Hartigan,L. Wu, C. Liu, G. Parmigiani, B. H. Park, and K. E. Bachman. The Consensus Coding Sequences of Human Breast andColorectal Cancers. Science, 314(October):268–274, 2006. ISSN 0036-8075, 1095-9203.doi:10.1126/science.1133427. URL https://www.ncbi.nlm.nih.gov/pubmed/16959974. → pages 2

[160] M. R. Spalinger, R. Manzini, L. Hering, J. B. Riggs, C. Gottier, S. Lang, K. Atrott, A. Fettelschoss, F. Olomski, T. M.Kundig, M. Fried, D. F. McCole, G. Rogler, and M. Scharl. PTPN2 Regulates Inflammasome Activation and ControlsOnset of Intestinal Inflammation and Colon Cancer. Cell reports, 22(7):1835–1848, feb 2018. ISSN 2211-1247.doi:10.1016/j.celrep.2018.01.052. → pages 86

[161] M. R. Stratton, P. J. Campbell, and P. A. Futreal. The cancer genome. Nature, 458(7239):719–24, Apr. 2009. ISSN1476-4687. doi:10.1038/nature07943. URL https://www.ncbi.nlm.nih.gov/pubmed/19360079. → pages 1, 10

[162] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy,T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach forinterpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States ofAmerica, 102(43):15545–50, oct 2005. ISSN 0027-8424. doi:10.1073/pnas.0506580102. → pages 44, 68, 89

[163] P. H. Sugarbaker and D. Chang. Long-term regional chemotherapy for patients with epithelial malignant peritonealmesothelioma results in improved survival. European Journal of Surgical Oncology, 43(7):1228–1235, 2017. ISSN15322157. doi:10.1016/j.ejso.2017.01.009. URL http://dx.doi.org/10.1016/j.ejso.2017.01.009. → pages 50

114

[164] L. Sun, A. M. Hui, Q. Su, A. Vortmeyer, Y. Kotliarov, S. Pastorino, A. Passaniti, J. Menon, J. Walling, R. Bailey,M. Rosenblum, T. Mikkelsen, and H. A. Fine. Neuronal and glioma-derived stem cell factor induces angiogenesiswithin the brain. Cancer Cell, 9(4):287–300, 2006. ISSN 15356108. doi:10.1016/j.ccr.2006.03.003. → pages 24

[165] C. Suo, O. Hrydziuszko, D. Lee, S. Pramana, D. Saputra, H. Joshi, S. Calza, and Y. Pawitan. Integration of somaticmutation, expression and functional data reveals potential driver genes predictive of breast cancer survival.Bioinformatics, 31(16):2607, Mar. 2015. doi:10.1093/bioinformatics/btv164. → pages 4

[166] S. Suthram, A. Beyer, R. M. Karp, Y. Eldar, and T. Ideker. eQED: an efficient method for interpreting eQTLassociations using protein networks. Molecular systems biology, 4(162):162, 2008. ISSN 1744-4292.doi:10.1038/msb.2008.4. → pages 5

[167] D. Szklarczyk, a. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, a. Roth, a. Santos,K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering. STRING v10: protein-protein interaction networks,integrated over the tree of life. Nucleic Acids Research, 43(D1):D447–D452, 2014. ISSN 0305-1048.doi:10.1093/nar/gku1003. URL http://www.ncbi.nlm.nih.gov/pubmed/25352553. → pages 67

[168] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos,K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering. String v10: proteinprotein interaction networks,integrated over the tree of life. Nucleic Acids Research, 43(D1):D447–D452, 2015. doi:10.1093/nar/gku1003. → pages23

[169] B. S. Taylor, N. Schultz, H. Hieronymus, A. Gopalan, Y. Xiao, B. S. Carver, V. K. Arora, P. Kaushik, E. Cerami,B. Reva, Y. Antipin, N. Mitsiades, T. Landers, I. Dolgalev, J. E. Major, M. Wilson, N. D. Socci, A. E. Lash, A. Heguy,J. a. Eastham, H. I. Scher, V. E. Reuter, P. T. Scardino, C. Sander, C. L. Sawyers, and W. L. Gerald. Integrative genomicprofiling of human prostate cancer. Cancer cell, 18(1):11–22, jul 2010. ISSN 1878-3686.doi:10.1016/j.ccr.2010.05.026. → pages 24

[170] TCGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407):330–7, July2012. ISSN 1476-4687. doi:10.1038/nature11252. URL https://www.ncbi.nlm.nih.gov/pubmed/22810696. → pages 3,77, 82, 84

[171] N. Tebbutt, M. W. Pedersen, and T. G. Johns. Targeting the ERBB family in cancer: couples therapy. Nature ReviewsCancer, 13(9):663–673, 2013. ISSN 1474-175X. doi:10.1038/nrc3559. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23949426. → pages 84

[172] J. R. Testa. Asbestos and Mesothelioma. Current Cancer Research. Springer International Publishing, 2017. ISBN978-3-319-53558-6. doi:10.1007/978-3-319-53560-9. URL http://link.springer.com/10.1007/978-3-319-53560-9. →pages 50

[173] P. Tetali. Design of on-line algorithms using hitting times. SIAM J. Comput., 28(4):1232–1246, 1999. → pages 13

[174] B. Thapa, A. Salcedo, X. Lin, M. Walkiewicz, C. Murone, M. Ameratunga, K. Asadi, S. Deb, S. A. Barnett, S. Knight,P. Mitchell, D. N. Watkins, P. C. Boutros, and T. John. The Immune Microenvironment, Genome-wide Copy NumberAberrations, and Survival in Mesothelioma. Journal of Thoracic Oncology, 12(5):850–859, 2017. ISSN 15561380.doi:10.1016/j.jtho.2017.02.013. URL http://dx.doi.org/10.1016/j.jtho.2017.02.013. → pages 51

[175] The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastomagenes and core pathways. Nature, 455(7216):1061–8, Oct. 2008. ISSN 1476-4687. doi:10.1038/nature07385. →pages 15, 22, 35, 44, 82

[176] The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353):609–15, June 2011. ISSN 1476-4687. doi:10.1038/nature10166. → pages 15, 22, 35, 36, 44

115

[177] The Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature,490(7418):61–70, Oct. 2012. ISSN 0028-0836. doi:10.1038/nature11412. → pages 15, 22, 35, 44, 82

[178] The Cancer Genome Atlas Research Network. The Molecular Taxonomy of Primary Prostate Cancer. Cell, 163(4):1011–25, nov 2015. ISSN 1097-4172. doi:10.1016/j.cell.2015.10.025. → pages 15, 22, 35, 36, 44

[179] H. Thorvaldsdottir, J. T. Robinson, and J. P. Mesirov. Integrative Genomics Viewer (IGV): High-performance genomicsdata visualization and exploration. Briefings in Bioinformatics, 14(2):178–192, 2013. ISSN 14675463.doi:10.1093/bib/bbs017. URL https://www.ncbi.nlm.nih.gov/pubmed/22517427. → pages 63

[180] S. a. Tomlins, D. R. Rhodes, S. Perner, S. M. Dhanasekaran, R. Mehra, X.-W. Sun, S. Varambally, X. Cao, J. Tchinda,R. Kuefer, C. Lee, J. E. Montie, R. B. Shah, K. J. Pienta, M. a. Rubin, and A. M. Chinnaiyan. Recurrent fusion ofTMPRSS2 and ETS transcription factor genes in prostate cancer. Science (New York, N.Y.), 310(5748):644–648, 2005.ISSN 0036-8075. doi:10.1126/science.1117679. → pages 36

[181] M. Torchala, P. Chelminiak, and P. A. Bates. Mean first-passage time calculations: Comparison of the deterministicHill’s algorithm with Monte Carlo simulations. European Physical Journal B, 85(4), 2012. ISSN 14346028.doi:10.1140/epjb/e2012-20760-8. → pages 7

[182] M. Torchala, P. Chelminiak, M. Kurzynski, and P. a. Bates. RaTrav: a tool for calculating mean first-passage times onbiochemical networks. BMC systems biology, 7:130, 2013. ISSN 1752-0509. doi:10.1186/1752-0509-7-130. URLhttp://www.ncbi.nlm.nih.gov/pubmed/24261882. → pages 7

[183] Z. Tu, L. Wang, M. N. Arbeitman, T. Chen, and F. Sun. An integrative approach for causal gene identification and generegulatory pathway inference. Bioinformatics, 22(14):489–496, 2006. ISSN 13674803.doi:10.1093/bioinformatics/btl234. → pages 6

[184] G. Ugurluer, K. Chang, M. E. Gamez, A. L. Arnett, R. Jayakrishnan, R. C. Miller, and T. T. Sio. Genome-basedMutational Analysis by Next Generation Sequencing in Patients with Malignant Pleural and Peritoneal Mesothelioma.Anticancer research, 36(5):2331–8, may 2016. ISSN 1791-7530. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27127140. → pages 51

[185] I. Ulitsky, A. Krishnamurthy, R. M. Karp, and R. Shamir. DEGAS: de novo discovery of dysregulated pathways inhuman diseases. PloS one, 5(10):e13367, oct 2010. ISSN 1932-6203. doi:10.1371/journal.pone.0013367. URLhttp://www.ncbi.nlm.nih.gov/pubmed/20976054. → pages 76, 87

[186] A. Untergasser, I. Cutcutache, T. Koressaar, J. Ye, B. C. Faircloth, M. Remm, and S. G. Rozen. Primer3–newcapabilities and interfaces. Nucleic acids research, 40(15):e115, aug 2012. ISSN 1362-4962. doi:10.1093/nar/gks596.URL http://www.ncbi.nlm.nih.gov/pubmed/22730293. → pages 65

[187] E. M. Van Allen, N. Wagle, P. Stojanov, D. L. Perrin, K. Cibulskis, S. Marlow, J. Jane-Valbuena, D. C. Friedrich,G. Kryukov, S. L. Carter, A. McKenna, A. Sivachenko, M. Rosenberg, A. Kiezun, D. Voet, M. Lawrence, L. T.Lichtenstein, J. G. Gentry, F. W. Huang, J. Fostel, D. Farlow, D. Barbie, L. Gandhi, E. S. Lander, S. W. Gray, S. Joffe,P. Janne, J. Garber, L. MacConaill, N. Lindeman, B. Rollins, P. Kantoff, S. A. Fisher, S. Gabriel, G. Getz, and L. A.Garraway. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples toguide precision cancer medicine. Nature medicine, 20(6):682–8, jun 2014. ISSN 1546-170X. doi:10.1038/nm.3559.→ pages 35

[188] E. Van Dyk, M. J. T. Reinders, and L. F. a. Wessels. A scale-space method for detecting recurrent DNA copy numberchanges with analytical false discovery rate control. Nucleic Acids Research, 41(9), 2013. ISSN 03051048.doi:10.1093/nar/gkt155. URL https://www.ncbi.nlm.nih.gov/pubmed/23476020. → pages 3

116

[189] F. Vandin, E. Upfal, and B. J. Raphael. Algorithms for detecting significantly mutated pathways in cancer. Journal ofcomputational biology : a journal of computational molecular cell biology, 18(3):507–22, Mar. 2011. ISSN1557-8666. doi:10.1089/cmb.2010.0265. → pages 6, 76

[190] F. Vandin, E. Upfal, and B. J. Raphael. De novo discovery of mutated driver pathways in cancer. Genome research, 22(2):375–85, Feb. 2012. ISSN 1549-5469. doi:10.1101/gr.120477.111. → pages 4, 76, 88

[191] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. Associating genes and protein complexes with disease vianetwork propagation. PLoS Computational Biology, 6(1), 2010. ISSN 1553734X. doi:10.1371/journal.pcbi.1000641.→ pages 6

[192] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto, et al. Inference of patient-specific pathway activities frommulti-dimensional cancer genomics data using PARADIGM. Bioinformatics (Oxford, England), 26(12):i237–45, June2010. ISSN 1367-4811. doi:10.1093/bioinformatics/btq182. → pages 4

[193] R. G. W. Verhaak, K. a. Hoadley, E. Purdom, V. Wang, Y. Qi, et al. Integrated genomic analysis identifies clinicallyrelevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer cell, 17(1):98–110, Jan. 2010. ISSN 1878-3686. doi:10.1016/j.ccr.2009.12.020. → pages 35

[194] R. Visconti, R. Della Monica, and D. Grieco. Cell cycle checkpoint in cancer: a therapeutically targetabledouble-edged sword. Journal of experimental & clinical cancer research : CR, 35(1):153, sep 2016. ISSN 1756-9966.doi:10.1186/s13046-016-0433-9. → pages 86

[195] B. Vogelstein, N. Papadopoulos, V. E. Velculescu, S. Zhou, L. a. Diaz, and K. W. Kinzler. Cancer genome landscapes.Science (New York, N.Y.), 339(6127):1546–58, mar 2013. ISSN 1095-9203. doi:10.1126/science.1235122. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23539594. → pages 1, 2, 10, 75

[196] V. Walter, A. B. Nobel, and F. a. Wright. DiNAMIC: A method to identify recurrent DNA copy number aberrations intumors. Bioinformatics, 27(5):678–685, 2011. ISSN 13674803. doi:10.1093/bioinformatics/btq717. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21183584. → pages 3

[197] K. Wang, M. Li, and H. Hakonarson. ANNOVAR: Functional annotation of genetic variants from high-throughputsequencing data. Nucleic Acids Research, 38(16):1–7, 2010. ISSN 03051048. doi:10.1093/nar/gkq603. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20601685. → pages 3, 63

[198] K. Wang, R. Shrestha, A. W. Wyatt, A. Reddy, J. Lehar, Y. Wang, A. Lapuk, and C. C. Collins. A meta-analysisapproach for characterizing pan-cancer mechanisms of drug sensitivity in cell lines. PloS one, 9(7):e103050, 2014.ISSN 1932-6203. doi:10.1371/journal.pone.0103050. URL http://www.ncbi.nlm.nih.gov/pubmed/25036042. → pages9

[199] M. D. Wilkerson and D. N. Hayes. ConsensusClusterPlus: A class discovery tool with confidence assessments and itemtracking. Bioinformatics, 26(12):1572–1573, 2010. ISSN 13674803. doi:10.1093/bioinformatics/btq170. URLhttps://www.ncbi.nlm.nih.gov/pubmed/20427518. → pages 67

[200] L. D. Wood, D. W. Parsons, S. Jones, J. Lin, T. Sjoblom, R. J. Leary, D. Shen, S. M. Boca, T. Barber, J. Ptak,N. Silliman, S. Szabo, Z. Dezso, V. Ustyanksky, T. Nikolskaya, Y. Nikolsky, R. Karchin, P. a. Wilson, J. S. Kaminker,Z. Zhang, R. Croshaw, J. Willis, D. Dawson, M. Shipitsin, J. K. V. Willson, S. Sukumar, K. Polyak, B. H. Park, C. L.Pethiyagoda, P. V. K. Pant, D. G. Ballinger, A. B. Sparks, J. Hartigan, D. R. Smith, E. Suh, N. Papadopoulos,P. Buckhaults, S. D. Markowitz, G. Parmigiani, K. W. Kinzler, V. E. Velculescu, and B. Vogelstein. The genomiclandscapes of human breast and colorectal cancers. Science (New York, N.Y.), 318(5853):1108–1113, 2007. ISSN1095-9203. doi:10.1126/science.1145720. URL https://www.ncbi.nlm.nih.gov/pubmed/17932254. → pages 3

117

[201] A. W. Wyatt, F. Mo, K. Wang, B. McConeghy, S. Brahmbhatt, L. Jong, D. M. Mitchell, R. L. Johnston, A. Haegert,E. Li, J. Liew, J. Yeung, R. Shrestha, A. V. Lapuk, A. McPherson, R. Shukin, R. H. Bell, S. Anderson, J. Bishop,A. Hurtado-Coll, H. Xiao, A. M. Chinnaiyan, R. Mehra, D. Lin, Y. Wang, L. Fazli, M. E. Gleave, S. V. Volik, and C. C.Collins. Heterogeneity in the inter-tumor transcriptome of high risk prostate cancer. Genome biology, 15(8):426, aug2014. ISSN 1474-760X. doi:10.1186/s13059-014-0426-y. URL http://www.ncbi.nlm.nih.gov/pubmed/25155515. →pages 9

[202] M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, H. Ouyang, P. Radivojac, C. Sahinalp, F. Menczer,Y. Chang, A. Saha, H. Mamitsuka, and D. Yin. Ultra High-Dimensional Nonlinear Feature Selection for Big BiologicalData. IEEE Transactions on Knowledge and Data Engineering, 30(7):1352–1365, 2018. ISSN 1041-4347.doi:10.1109/TKDE.2018.2789451. URL https://doi.org/10.1109/TKDE.2018.2789451. → pages 9

[203] X. Yao, H. Hao, Y. Li, and S. Li. Modularity-based credible prediction of disease genes and detection of diseasesubtypes on the phenotype-gene heterogeneous network. BMC systems biology, 5(1):79, 2011. ISSN 1752-0509.doi:10.1186/1752-0509-5-79. URL http://www.biomedcentral.com/1752-0509/5/79. → pages 7

[204] E. Yeger-Lotem, L. Riva, L. J. Su, A. D. Gitler, A. G. Cashikar, O. D. King, P. K. Auluck, M. L. Geddie, J. S.Valastyan, D. R. Karger, S. Lindquist, and E. Fraenkel. Bridging high-throughput genetic and transcriptional datareveals cellular responses to alpha-synuclein toxicity. Nature genetics, 41(3):316–323, 2009. ISSN 1061-4036.doi:10.1038/ng.337. → pages 6

[205] K. Yoshihara, A. Tajima, D. Komata, T. Yamamoto, S. Kodama, H. Fujiwara, M. Suzuki, Y. Onishi, M. Hatae,K. Sueyoshi, H. Fujiwara, Y. Kudo, I. Inoue, and K. Tanaka. Gene expression profiling of advanced-stage serousovarian cancers distinguishes novel subclasses and implicates ZEB2 in tumor progression and prognosis. CancerScience, 100(8):1421–1428, 2009. ISSN 13479032. doi:10.1111/j.1349-7006.2009.01204.x. → pages 24

[206] K. Yoshihara, M. Shahmoradgoli, E. Martınez, R. Vegesna, H. Kim, W. Torres-Garcia, V. Trevino, H. Shen, P. W. Laird,D. a. Levine, S. L. Carter, G. Getz, K. Stemke-Hale, G. B. Mills, and R. G. W. Verhaak. Inferring tumour purity andstromal and immune cell admixture from expression data. Nature communications, 4:2612, 2013. ISSN 2041-1723.doi:10.1038/ncomms3612. URL http://www.ncbi.nlm.nih.gov/pubmed/24113773. → pages 58, 68

[207] K. Yoshihara, Q. Wang, W. Torres-Garcia, S. Zheng, R. Vegesna, H. Kim, and R. G. W. Verhaak. The landscape andtherapeutic relevance of cancer-associated transcript fusions. Oncogene, 34(37):4845–4854, 2014. ISSN 0950-9232.doi:10.1038/onc.2014.406. → pages 23, 35, 36

[208] Y. Yoshikawa, M. Emi, T. Hashimoto-Tamaoki, M. Ohmuraya, A. Sato, T. Tsujimura, S. Hasegawa, T. Nakano,M. Nasu, S. Pastorino, A. Szymiczek, A. Bononi, M. Tanji, I. Pagano, G. Gaudino, A. Napolitano, C. Goparaju, H. I.Pass, H. Yang, and M. Carbone. High-density array-CGH with targeted NGS unmask multiple noncontiguous minutedeletions on chromosome 3p21 in mesothelioma. Proceedings of the National Academy of Sciences of the United Statesof America, 113(47):13432–13437, 2016. ISSN 1091-6490. doi:10.1073/pnas.1612074113. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27834213. → pages 54, 75

[209] A. Youn and R. Simon. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics (Oxford,England), 27(2):175–81, Jan. 2011. ISSN 1367-4811. doi:10.1093/bioinformatics/btq630. URLhttps://www.ncbi.nlm.nih.gov/pubmed/21169372. → pages 2

[210] H. Yu, H. Pak, I. Hammond-Martel, M. Ghram, A. Rodrigue, S. Daou, H. Barbour, L. Corbeil, J. Hebert, E. Drobetsky,J. Y. Masson, J. M. Di Noia, and E. B. Affar. Tumor suppressor and deubiquitinase BAP1 promotes DNA double-strandbreak repair. Proceedings of the National Academy of Sciences, 111(1):285–290, 2014. ISSN 0027-8424.doi:10.1073/pnas.1309085110. URL http://www.ncbi.nlm.nih.gov/pubmed/24347639. → pages 60

118

[211] S. Zaccaria, M. El-kebir, G. W. Klau, and B. J. Raphael. The Copy-Number Tree Mixture Deconvolution Problem andApplications to Multi-sample Bulk Sequencing Tumor Data. In S. C. Sahinalp, editor, Research in ComputationalMolecular Biology, pages 318–335, Cham, 2017. Springer International Publishing. ISBN 978-3-319-56970-3.doi:10.1007/978-3-319-56970-3 20. URL http://link.springer.com/10.1007/978-3-319-56970-3. → pages 78

[212] Q. Zhang, L. Ding, D. E. Larson, D. C. Koboldt, M. D. McLellan, K. Chen, X. Shi, A. Kraja, E. R. Mardis, R. K.Wilson, I. B. Borecki, and M. a. Province. CMDS: A population-based method for identifying recurrent DNA copynumber aberrations in cancer from high-resolution data. Bioinformatics, 26(4):464–469, 2009. ISSN 13674803.doi:10.1093/bioinformatics/btp708. URL https://www.ncbi.nlm.nih.gov/pubmed/20031968. → pages 3

119