deep radiomics analytic pipeline for prognosis of ......chapter 1: literature review 1.1 pancreatic...

Deep Radiomics Analytics Pipeline for Prognosis of Pancreatic Ductal Adenocarcinoma

By

Yucheng Zhang

A thesis submitted in conformity with the requirements

for the degree of Master of Science

Institute of Medical Science

University of Toronto

© Copyright by Yucheng Zhang (2019)

ii

Deep Radiomics Analytics Pipeline for Pancreatic Ductal Adenocarcinoma

Yucheng Zhang

Master of Science

Institute of Medical Science

University of Toronto

2019

Abstract

Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive cancers with

extremely poor prognosis. Radiomics has shown prognostic ability in multiple types of

cancer including PDAC. However, the prognostic value of traditional radiomics pipelines,

which are based on hand-crafted radiomic features alone, is limited due to high correlation

among features and the multiple testing problem. Deep learning architectures, such as

Convolutional Neural networks (CNNs) have been shown to outperform traditional feature-

based approaches in computer vision tasks such as object detection. Nonetheless, they require

large sample sizes for training which limits their application in medical imaging. As an

alternative solution, CNN-based transfer learning has shown potential for achieving

reasonable performance using datasets with small sample sizes. In this work, we developed a

CNN-based deep radiomics pipeline based on transfer learning, which outperforms the

traditional radiomics model in resectable PDAC prognostication.

iii

Acknowledgements

First and foremost, I would like to thank my supervisors, Dr. Farzad Khalvati and Dr. Masoom Haider. I

appreciate all your contributions of time, inspirations and efforts. This project would not have been

possible without your guidance. It is a great honor to be a member of this research group for four years.

As an international student, I realized that, this research group has become my home in Canada.

I would like to express my sincerest gratitude to my committee members: Dr. Babak Taati and Dr.

Qiang Sun. It was my pleasure to discuss the research with you and I sincerely appreciate your valuable

time, suggestions, and insightful discussions.

I would also like to thank my family, mother Chen Wang, my grandparents Liying Qian and Tiancai

Wang. Thank you for your love, supports, and encouragements on this journey. Although we are few

thousand miles away, every phone call and text message from you has strengthened my resolve. I also

need to thank my dear friends in Beijing, Connecticut, and Toronto. Thank you for all your support.

In the end, I want to express my deepest gratitude and respect to patients enrolled in this study. It is your

contribution that made this project possible.

iv

Contributions

Dr. Farzad Khalvati and Dr. Masoom Haider:

Supervised and directed all aspects of this research study and thesis.

Dr. Edrise M. Lobo-Mueller:

Contoured the Region of Interest on pre-operative CT images.

Dr. Paul Karanicolas and Dr. Steven Gallinger:

Assisted in patient enrollment and provided essential data.

v

Table of Contents

Abstract ....................................................................................................................................................... ii

Acknowledgements .................................................................................................................................... iii

Contributions.............................................................................................................................................. iv

List of tables ............................................................................................................................................... ix

List of Figures ............................................................................................................................................. x

List of Abbreviations ................................................................................................................................ xii

Chapter 1: Literature Review ...................................................................................................................... 1

1.1 Pancreatic Ductal Adenocarcinoma .................................................................................................. 1

1.1.1 Introduction ................................................................................................................................ 1

1.1.2 Risk factors ................................................................................................................................ 1

1.1.3 Diagnostic biomarkers ............................................................................................................... 3

1.1.4 Treatment ................................................................................................................................... 6

1.1.5 Biomarker for chemotherapy response ...................................................................................... 7

1.1.6 Prognostic markers..................................................................................................................... 9

1.2 Radiomics: analysis of quantitative imaging markers .................................................................... 11

1.2.1 Introduction .............................................................................................................................. 11

1.2.2 Pipeline .................................................................................................................................... 12

1.2.3 Segmentation............................................................................................................................ 12

1.2.4 Feature extraction..................................................................................................................... 14

1.2.5 Feature analysis and model building........................................................................................ 15

1.2.6 Current progress ....................................................................................................................... 17

1.2.7 Limitations of traditional radiomics analytic pipeline ............................................................. 20

vi

1.3 Deep learning in medical imaging .................................................................................................. 23

1.3.1 Neural Networks and CNN ...................................................................................................... 23

1.3.2 ResNet ...................................................................................................................................... 31

1.3.3 Transfer learning ...................................................................................................................... 32

1.3.4 Deep learning in medical imaging research ............................................................................. 34

1.3.5 Future direction ........................................................................................................................ 38

Chapter 2: Aim and hypothesis ................................................................................................................. 40

2.1 Study 1 ............................................................................................................................................ 40

2.1.1 Aims ......................................................................................................................................... 40

2.1.2 Hypothesis................................................................................................................................ 40

2.1.3 Rationale for hypothesis .......................................................................................................... 41

2.2 Study 2 ............................................................................................................................................ 41

2.2.1 Aims ......................................................................................................................................... 41

2.2.2 Hypothesis................................................................................................................................ 42


2.3 Study 3 ............................................................................................................................................ 43

2.3.1 Aims ......................................................................................................................................... 43

2.3.2 Hypothesis................................................................................................................................ 43


Chapter 3: Study 1 .................................................................................................................................... 45

3.1 Abstract ........................................................................................................................................... 46

3.2 Introduction ..................................................................................................................................... 46

3.3 Methods........................................................................................................................................... 49

3.3.1 Dataset...................................................................................................................................... 49

3.3.2 Radiomics feature extraction ................................................................................................... 50

vii

3.3.3 Transfer learning ...................................................................................................................... 51

3.3.4 Feature analysis ........................................................................................................................ 52

3.4 Results ............................................................................................................................................. 53

3.4.1 Feature-wise prognostic values ................................................................................................ 53

3.4.2 Prognostic model performance ................................................................................................ 53

3.4.3 Risk score ................................................................................................................................. 54

3.5 Discussion ....................................................................................................................................... 56

3.6 Conclusion ...................................................................................................................................... 58

Chapter 4: Study 2 .................................................................................................................................... 59

4.1: Abstract .......................................................................................................................................... 60

4.2: Introduction .................................................................................................................................... 60

4.3 Methods........................................................................................................................................... 64

4.3.1 Dataset...................................................................................................................................... 64

4.3.2 Radiomics Feature Extraction .................................................................................................. 64

4.3.3 Transfer Learning Feature Extraction ...................................................................................... 65

4.3.4 Correlation ............................................................................................................................... 66

4.3.5 Proposed Prognosis Model ...................................................................................................... 66

4.4 Results ............................................................................................................................................. 68

4.4.1 Correlation Analysis Between Pre-defined and Deep Radiomic Features .............................. 68

4.4.2 Prognosis Performance of the Proposed Prognosis Model ...................................................... 70

4.5 Discussion ....................................................................................................................................... 73

Chapter 5: Study 3 .................................................................................................................................... 75

5.1 Abstract ........................................................................................................................................... 76

5.2 Introduction ..................................................................................................................................... 76

5.3 Methods........................................................................................................................................... 79

viii

5.3.1 Data .......................................................................................................................................... 79

5.3.2 Architecture of the proposed CNN-Survival ........................................................................... 79

5.3.3 Loss Function ........................................................................................................................... 80

5.3.4 Training process and Transfer Learning .................................................................................. 80

5.3.5 Traditional Radiomics analytic pipeline .................................................................................. 81

5.4 Results ............................................................................................................................................. 84

5.5 Discussion ....................................................................................................................................... 86

5.6 Conclusion ...................................................................................................................................... 87

Chapter 6: General Discussion.................................................................................................................. 88

6.1: Study 1 ........................................................................................................................................... 88

6.1.1 Discussion ................................................................................................................................ 88

6.1.2 Strength and limitations ........................................................................................................... 91

6.1.3 Implications.............................................................................................................................. 92

6.2: Study 2 ........................................................................................................................................... 93

6.2.1 Discussion ................................................................................................................................ 93


6.2.3 Implications.............................................................................................................................. 97

6.3: Study 3 ........................................................................................................................................... 97

6.3.1 Discussion ................................................................................................................................ 97


Chapter 7: Conclusions ........................................................................................................................... 100

Chapter 8: Future directions.................................................................................................................... 101

References ............................................................................................................................................... 102

Appendix ................................................................................................................................................. 131

ix

List of tables

Table 1.1: List of available biomarkers and their performances ................................................................4

Table 1.2: List of ECOG criteria ................................................................................................................9

Table 1.3: List of common features .........................................................................................................14

Table 1.4: List of recent radiomics studies and their performances in AUC ............................................22

Table 1.5: List of representative segmentation studies in the medical imaging field .............................37

Table 3.1: List of radiomic feature classes and filters .............................................................................51

Table 3.2: List of hazard ratios and p values ...........................................................................................56

Table 4.1: Number of features extracted from different filters ................................................................65

Table 4.2: Absolute Pearson correlation coefficient between features ....................................................68

Table 4.3: Summary table for models using four feature reduction methods ..........................................71

Table 5.1: Concordance index of proposed models .................................................................................85

Table A.1: List of significant PyRadiomics features for PDAC prognosis ...........................................131

x

List of Figures

Figure 1.1: Traditional Radiomics Pipeline ..............................................................................................12

Figure 1.2: Typical CNN architecture ......................................................................................................25

Figure 1.3: Graphical presentation of convolution operations .................................................................26

Figure 1.4: Graphical representation of zero-padding .............................................................................27

Figure 1.5: Graphical representation of max pooling ..............................................................................28

Figure 1.6: Graphical representation of Fully Connected Layers ............................................................29

Figure 1.7: Graphical representation of gradient descent algorithm ........................................................30

Figure 1.8: Graphical representation of identity path ..............................................................................31

Figure 1.9: Graphical representation of transfer learning in CNN ...........................................................33

Figure 3.1: Manual contour of CT scan from a representative patient in cohort 2 ...................................50

Figure 3.2: Workflow for transfer learning studies ..................................................................................52

Figure 3.3: ROC curves ............................................................................................................................54

Figure 3.4: Kaplan-Meier plots for OS in Cohort 2 ..................................................................................55

Figure 4.1: Pipelines for different feature fusion methods ......................................................................67

Figure 4.2: Correlation heatmap of three different feature extraction methods .......................................69

Figure 4.3: Histogram of Pearson correlation coefficients .......................................................................70

Figure 4.4: ROC curves of models using four feature reduction methods ..............................................72

Figure 5.1: The proposed CNN-Survival architecture .............................................................................82

xi

Figure 5.2: Example of input CT images ................................................................................................83

Figure 5.3: Example of small ROI in Cohort 1 .........................................................................................83

Figure 5.4: Loss changes during pre-train ................................................................................................84

Figure 5.5: Survival probability example 1 .............................................................................................85

Figure 5.6: Survival probability example 2 ..............................................................................................86

xii

List of Abbreviations

2D Two-dimensional

3D Three-dimensional

AI Artificial Intelligence

ANN Artificial Neural Network

ANOVA Analysis of Variance

AUC Area Under Curve

CAD Computer-aided diagnosis

CADe Computer-aided detection

CAM Class activation map

CBCT Cone beam computed tomography

CCC Concordance correlation coefficient

CI Confidence interval

CNN Convolutional Neural Network

CONV Convolution

CPH Cox Proportional Hazards Model

CT Computed tomography

DL Deep Learning

DNN Deep Neural Network

xiii

FBP Filtered back projection

FC Fully Connected

FCN Fully Convolutional Network

GAN Generative Adversarial Network

GLM Generalized linear model

GPU Graphical processing unit

HR Hazard Ratio

ICA Independent component analysis

ICC Intraclass correlation coefficient

ISBI International Symposium on Biomedical Imaging

LDA Linear Discriminant Analysis

LSTM Long Short-Term Memory

MR Magnetic resonance

NN Neural Network

NSCLC Non-small cell lung cancer

PCA Principal Component analysis

PDAC Pancreatic Ductal Adenocarcinoma

PET Positron emission tomography

ReLU Rectified Linear Unit

xiv

RF Random Forest

RGB Red, green, and blue

RNN Recurrent Neural Network

ROC Receiver operating characteristic

RR Relative Risk

SDG Stochastic gradient descent

SMOTE Synthetic Minority Over-sampling Technique

SVM Support Vector Machine

1

Chapter 1: Literature Review

1.1 Pancreatic Ductal Adenocarcinoma

1.1.1 Introduction

Pancreatic Ductal Adenocarcinoma (PDAC) is a type of lethal cancer with poor prognosis and

increasing incidence. It is estimated that each year, more than 350,000 people worldwide are diagnosed

with PDAC (McGuigan et al., 2018). Nevertheless, PDAC has a low 5-year survival rate, which stands

at approximately 7.1% (Stark et al., 2016). Hence, PDAC is ranked as the fourth leading cause of

cancer-related deaths (Ilic & Ilic, 2016). In addition, incidence rates vary significantly around the world.

Wong et al. showed that developed countries have higher incidence rates compared to developing

countries (Wong et al., 2017). It has been found that, Europe and North America have the highest age-

standardized incidence rates (Ilic & Ilic, 2016). Moreover, the incidence rate is increasing in the Western

World. Saad et al. found that the incidence rate is increasing by 1.03% per year in the United States after

age adjustment. It is estimated that, by 2030, pancreatic cancer will become the second most common

cause of cancer-related death in the United States (Siegel et al., 2009).

Significant improvements in cancer screening methods and treatment therapies have improved survival

rates for most cancers (Adamska, Domenichini, & Falasca, 2017a; Urruticoechea et al., 2010).

Unfortunately, the survival rate remains almost at the same level for PDAC patients (Adamska et al.,

2017a). In this study, we aimed to develop a CT image-based prognosis model for PDAC patients,

helping healthcare professionals make personalized and efficient treatment plans. In order to develop

this model, it is critical to review recognized PDAC risk factors, treatment options, diagnostic and

prognostic markers. Details of this information will be discussed in the following sections of Chapter 1.

1.1.2 Risk factors

Researchers have identified several risk factors for PDAC, including sex, age, blood group, gut

microbiota, diabetes, smoking, and family history (Arnold et al., 2009; Bosetti et al., 2012; Memba et

al., 2017; Midha, Chawla, & Garg, 2016; Pernick et al., 2003; Rohrmann et al., 2009; Silverman et al.,

2003; Wahi, Shah, Schrock, Rosemurgy, & Goldin, 2009; B. M. Wolpin et al., 2009; Brian M. Wolpin

2

et al., 2010; WOOD et al., 2006). However, it must be noted that some of these risk factors were

identified on small sample case-control studies with inevitable selection bias (McGuigan et al., 2018). In

the following sections, the risk factors identified in the previous academic literature will be explored.

Sex

The incidence rates vary between genders. It has been shown that the worldwide age-standardized

incidence rate is 5.5% for male and 4.0% for female (McGuigan et al., 2018). In developed countries,

the difference is more pronounced. This disparity may be attributed to different levels of exposures to

other risk factors such as smoking and smokeless tobacco use. Notably, a systematic review of 15 PDAC

studies concluded that reproductive factors were not associated with pancreatic cancer in women (Wahi

et al., 2009).

Age

The incidence rates for pancreatic cancer have a positive correlation with age (McGuigan et al., 2018).

90% of the pancreatic cancer patients are over 55 years of age (Midha et al., 2016; WOOD et al., 2006).

For different countries, incidence rate peaks at different ages. In the United States, the majority of the

newly diagnosed patients are in their seventh decade of life, while in India, the disease typically peaks

among patients in their sixth decade (McGuigan et al., 2018; Midha et al., 2016).

Blood group

In a meta-analysis, Wolpin et al. found that, compared to people with blood type O, individuals with

other blood type have higher risks of developing pancreatic adenocarcinoma, namely A (HR: 1.32,

95%CI:1.02-1.72), B (HR: 1.72, 95%CI:1.25-2.38), and AB (HR: 1.51, 95%CI: 1.02-2.23) (B. M.

Wolpin et al., 2009). This finding was confirmed by a follow-up epidemiological study (Brian M.

Wolpin et al., 2010). It was hypothesized that the inflammatory state across different ABO groups and

alternation in glycosyltransferase specificity may explain the disparities (McGuigan et al., 2018; Brian

M. Wolpin et al., 2010).

3

Gut microbiota

Memba et al. found that people with lower levels of Neisseria elongate and Streptococcus mitis, and

higher levels of Porphyromonas gingivalis and Granulicatella adiacens had higher risks of developing

pancreatic cancer (Memba et al., 2017). However, confounding variables cannot be ruled out, and

further studies are required to validate these findings (McGuigan et al., 2018).

Family History

Among all pancreatic cancer patients, 5 to 10% of patients have two or more first-degree relatives who

were previously diagnosed with pancreatic cancer (Hruban, Canto, Goggins, Schulick, & Klein, 2010).

Compared to an individual with no family history, a person, who has one first degree relative with

pancreatic cancer, faces an 80% increase in the risk of developing pancreatic cancer (RR: 1.8, 95%CI:

1.48-2.12) (Permuth-Wey & Egan, 2009). If an individual has three or more first-degree relatives who

were previously diagnosed with PDAC, the individual has a thirty-two times higher risk of developing

pancreatic cancer (Becker, Hernandez, Frucht, & Lucas, 2014).

Diabetes

Steven et al. found that patients with type I diabetes are 200% more likely to develop pancreatic cancer

when compared to the patients without diabetes (RR: 2.00, 95%CI: 1.37-3.01) (Stevens, Roddam, &

Beral, 2007). Similarly, for a patient with type II diabetes, the odds ratio is 1.82, with 95% Confidence

interval at 1.66 to 1.89 (Huxley, Ansary-Moghaddam, Berrington de González, Barzi, & Woodward,

2005). Nevertheless, it must be noted that, PDAC itself can cause diabetes. Hence, it is important to

consider and control the confounding variables in investigating risk factors.

1.1.3 Diagnostic biomarkers

Pancreatic cancer patients are often diagnosed in the late stages which are not resectable and cause low

survival rates. Consequently, early detection of pancreatic cancer is critical in effective treatment and

management. Although several biomarkers have been found, none of them is an ideal candidate due to a

variety of limitations (Loosen, Neumann, Trautwein, Roderburg, & Luedde, 2017). Table 1.1 below lists

the biomarkers and their performances where sensitivity is defined as the probability of a positive test

4

given that the patient has the disease, and specificity is the probability of a negative test from a healthy

person. Early diagnosis is essential for successful PDAC treatment and associated with prognosis of

PDAC patients. It is essential to review the literature in this field. Details of these biomarkers will be

discussed in the following paragraphs.

Table 1.1: List of available biomarkers and their performance for PDAC diagnosis

Biomarker Sensitivity (%) Specificity (%) Reference

CA19-9 81 81 (Y Zhang et al., 2015)

CA50 71.1 93.5 (Liao et al., 2007)

CA72-4 63.4 75.2 (WU et al., 2006)

CA125 66.8 83.3 (Jiang, Tao, & Zou, 2004)

CA242 67.8 83 (Y Zhang et al., 2015)

CEA 39.5 81.3 (Y Zhang et al., 2015)

MIC-1 79.0 86.0 (Y.-Z. Chen et al., 2014)

PAM4 76.0 85.0 (David V. Gold et al., 2013)

miR-21 90.0 66.7 (J.-Y. Yang et al., 2014)

miR-155 76.7 73.3 (J.-Y. Yang et al., 2014)

miR-143 and miR-30e 83.3 96.2 (J.-Y. Yang et al., 2014)

CA19-9 and other carbohydrate antigens

CA19-9 is the abbreviation of carbohydrate antigen 19-9, also known as Sialyl-Lewis. It is the only

biomarker approved by FDA for PDAC diagnosis (Goonetilleke & Siriwardena, 2007). However, in

PDAC diagnosis, CA19-9 has several limitations (Loosen et al., 2017). First, the serum level of CA19-9

not only suggests the presence of PDAC, but may also indicate other medical conditions including

5

pancreatitis, obstructive jaundice, acute cholangitis, and liver cirrhosis (Ballehaninna & Chamberlain,

2012; Perkins, Slater, Sanders, & Prichard, 2003; Satake, Kanazawa, Kho, Chung, & Umeyama, 1985;

Steinberg, 1990). Second, for PDAC diagnosis, CA19-9 has median sensitivity and specificity at 75%

and 77% respectively, indicating that CA19-9 does not qualify as an accurate screening biomarker (Y

Zhang et al., 2015). Last, approximately 5%-10% of the Caucasians have Lewis-null blood type that

does not produce CA19-9. This further limits the usage of CA19-9 as a screening tool (Goonetilleke &

Siriwardena, 2007; Von Rosen, Linder, Harmenberg, & Pegert, 1993).

Other carbohydrate antigens including CEA and CA50, CA195, CA72-4, and CA125 have also shown

diagnostic potential as screening biomarkers (Bünger, Laubert, Roblick, & Habermann, 2011; Y Zhang

et al., 2015). However, performances of these biomarkers are also limited. As shown in Table 1.1, CEA

has 39.5% sensitivity and 81.3% specificity. Further research is required to develop robust diagnostic

screening biomarkers using carbohydrate antigens.

Non-coding RNAs

miRNA stands for microRNA which is a group of non-coding RNA that is involved in post-

transcriptional regulations, targeting and degrading other RNAs with specific sequence (Bartel, 2009;

Lagos-Quintana, 2001). For many types of cancer, miRNA is routinely used as a detection biomarker

(Hong & Park, 2014; Rosenfeld et al., 2008). For PDAC diagnosis, multiple miRNAs showed potential,

including miR-21, miR-155, miR-196a, miR-216, miR-217, and miR-210 (Bloomston et al., 2007;

Caponi et al., 2013; Dillhoff, Liu, Frankel, Croce, & Bloomston, 2008; Schultz et al., 2012; Szafranska

et al., 2007).

These biomarkers were upregulated or downregulated in pancreatic tissue and juice (Hong & Park,

2014; Link, Becker, Goel, Wex, & Malfertheiner, 2012; Sadakari et al., 2010). However, acquisition of

these biomarkers is challenging since it may involve tissue biopsy, which is not appropriate for

screening tests. Recent research from Yang et al. targeted those miRNAs in fecal specimens and

achieved sensitivity and specificity at 83%. Hence, these non-invasive biomarker acquisitions have

significant potentials in PDAC screening process (J.-Y. Yang et al., 2014).

Other non-coding RNAs including IncRNA (MALAT-1, Gas5, MEG3, and HSATII), and snRNA also

have significantly different expression patterns in PDAC patients when compared to healthy individuals

6

(Kishikawa, 2015; Kung, Colognori, & Lee, 2013). Further studies are needed to comprehensively

evaluate the performance of those non-coding RNAs biomarkers.

MIC-1

MIC-1, also known as macrophage inhibitory cytokine 1, shows significant overexpression for multiple

types of cancer (Bootcov et al., 1997; Buckhaults et al., 2001). Koopmann et al. found that, for

pancreatic cancer diagnosis, with an area under ROC (AUC) at 0.99, MIC-1 has a significantly better

performance compared to CA19-9 (Koopmann, 2006). Moreover, when differentiating pancreatic cancer

from chronic pancreatitis, MIC-1 performs as good as CA19-9 (p value = 0.63) (Koopmann, 2006).

As discussed above, the usage of CA19-9 is limited in cases of Lewis-null blood type. Comparatively,

MIC-1 expression is universal. It has been shown that, among individuals in Lewis-null group, MIC-1

has sensitivity at 63% (X. Wang et al., 2014). However, the performance of MIC-1varies in different

studies (McGuigan et al., 2018). A meta-analysis from Chen et al. showed that serum MIC-1 level has

median sensitivity and specificity of 79% and 86% respectively, being significantly lower than results

from Koopman et al. (Y.-Z. Chen et al., 2014).

PAM4

PAM4, a new monoclonal antibody (MAb) known as clivatuzumab, is reactive with Mucin 5AC which

is expressed in pancreatic cancer and precursor lesions (D. V. Gold, Karanjawala, Modrak, Goldenberg,

& Hruban, 2007; David V. Gold, Lew, Maliniak, Hernandez, & Cardillo, 1994; D. Liu, Chang, Gold, &

Goldenberg, 2015). Gold et al. found that, in pancreatic cancer diagnosis tasks, PAM4 reached 76%

sensitivity and 85% specificity, which are significantly higher than CA19-9 (p value = 0.026) (David V.

Gold et al., 2013). Furthermore, combining PAM4 and CA19-9 produces the final model with 84%

sensitivity and 82% specificity (David V. Gold et al., 2013). Further validation in a large cohort is

expected to evaluate the potential of PAM4 as a PDAC diagnosis marker in clinical conditions.

1.1.4 Treatment

It has been found that surgical resection is the only treatment that offers a potential cure for patients with

pancreatic cancer. Surgical options for pancreatic cancer include pancreatic duodenectomy and distal/

total pancreatectomy (McGuigan et al., 2018). However, not every tumor is resectable. The decision is

7

mainly based on the relationship between pancreatic cancer and the surrounding vascular structures

(Lynch et al., 2009; McGuigan et al., 2018). Hence, less than 20% of patients are candidates for surgery

since PDAC often spreads before initial diagnosis (Foucher et al., 2018).

It has been shown that, for patients undergoing surgeries, adding chemotherapy improves overall

survival rate (Foucher et al., 2018; McGuigan et al., 2018). A recent study found that patients who took

adjuvant chemotherapy in addition to surgery had significantly higher median survival durations than

patients underwent surgery alone (André et al., 2015). Given that, finding patients with aggressive

tumors and offer “aggressive” treatments is important. Hence, it would be beneficial to provide accurate

prognoses for resectable PDAC patients and that is the goal of this study.

On the other hand, chemotherapy is the main option for patients with advanced and metastatic PDAC

(Foucher et al., 2018; McGuigan et al., 2018). It has been shown that, chemotherapy can increase

survival rate and relief cancer-related symptoms (Adamska, Domenichini, & Falasca, 2017b). Currently,

clinicians have several chemotherapy options for patients with pancreatic cancer including Gemcitabine/

Abraxane, and FOLFIRINOX. However, these therapies are not effective for all patients. Biomarkers

are needed to develop personalized treatment plans for PDAC patients. These personalized treatment

plans may improve patients’ quality of lives as well as lower expenses.

1.1.5 Biomarker for chemotherapy response

FOLFIRINOX biomarker

FOLFIRINOX is a common chemotherapy regimen for PDAC patients. It consists of leucovorin,

irinotecan, oxaliplatin, and 5-FU (Adamska et al., 2017b; Loosen et al., 2017). It has been shown to be

especially effective for patients with metastatic pancreatic cancer (Adamska et al., 2017b). However,

due to the systemic toxicity of this therapy, its usage is limited for the elder group (Conroy et al., 2011;

“FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer,” 2011; Gourgou-Bourgade et al.,

2013). Consequently, biomarkers are needed so that clinicians can identify patients who will be

benefited from FOLFIRINOX treatment.

In genomic studies, it has been found that patients with inactivation of BRCA1, BRCA2, and PALB2

have better responses to FOLFIRINOX regimen (Waddell et al., 2015). Moreover, high expression

8

levels of CES2 (carboxylesterase 2) in pancreatic cancer tissue are positively associated with survival

among patients receiving FOLFIRINOX (Capello et al., 2015). Follow-up studies are needed to further

validate these findings.

Gemcitabine/ Abraxane markers

For pancreatic cancer patients, another chemotherapy option is Gemcitabine plus Abraxane, which has

been used since 1997 (Kamisawa, Wood, Itoi, & Takaori, 2016). Due to the hydrophilic characteristic, it

is difficult for gemcitabine to diffuse into cells. Thus, activity levels of transporters are important

predictors for patients' responses to gemcitabine. Relevant transporters are concentrative nucleoside

transporters (CNT) and equilibrate nucleoside transporters (ENT) (Farrell et al., 2009; Yamada et al.,

2016).

CNT shifts gemcitabine using sodium gradient between cellular membranes (Greenhalf et al., 2014). It

has been found that, for patients taking Gemcitabine, individuals with high human CNT expressions

have higher survival rates compared to patients with lower CNT expression levels (p value = 0.028)

(Marechal et al., 2009).

Equilibrate nucleoside transporter (ENT) is another potential predictive marker for gemcitabine

response. It has been found that cell lines with high ENT expression have higher sensitivity rates to

gemcitabine in several in vitro studies (Spratlin, 2004). Further, multiple large-scale clinical studies

confirmed that patients with ENT expression have significantly higher median survival rates (Farrell et

al., 2009; Greenhalf et al., 2014).

Another potential biomarker for gemcitabine is deoxycytidine kinase, which converts gemcitabine into

its active form (Loosen et al., 2017). In a small cohort study, it was discovered that high deoxycytidine

kinase has a positive association with the duration of disease-free survival (Fujita et al., 2010;

Sebastiani, 2006). Undoubtedly, further studies are needed to validate and assess the performance of

these biomarkers.

9

1.1.6 Prognostic markers

Prognostic markers are specific patients’ characteristics that can be utilized to predict the course of

diseases (McGuigan et al., 2018). Robust and valid prognostic marker can help healthcare professionals

designing optimal treatment plans in patients' best interests. Serval prognostic markers have been found

in PDAC, and their details are discussed below.

ECOG performance status

ECOG (Eastern Cooperative Oncology Group) performance status is a well-established prognostic

marker for different types of cancers, including PDAC (Sørensen, Klee, Palshof, & Hansen, 1993). The

grading criteria are listed in table 1.2 below. It has been found that patients with high ECOG

performance status grade may not be benefited from combined chemotherapies (Louvet et al., 2005;

Peixoto et al., 2015).

Table 1.2: List of ECOG criteria (Oken et al., 1982)

Grade ECOG Performance status

0 Fully active, able to carry on all pre-disease performance without restriction

1 Restricted in physically strenuous activity but ambulatory and able to carry out work of

a light or sedentary nature, e.g., light housework, office work

2 Ambulatory and capable of all self-care but unable to carry out any work activities; up

and about more than 50% of waking hours

3 Capable of only limited self-care; confined to bed or chair more than 50% of waking

hours

4 Completely disabled; cannot carry on any self-care; confined to bed or chair

5 Dead

SPARC

SPARC (secreted protein acidic and rich in cysteine), also known as osteonectin, is a calcium-binding

glycoprotein. SPARC involves in serval cellular processes, including cell differentiation and

proliferation (McGuigan et al., 2018). Studies showed that SPARC expression has a negative association

10

with survival (C.-S. Wang, Lin, Chen, Chan, & Hsueh, 2004; Watkins, Douglas-Jones, Bryce, E Mansel,

& Jiang, 2005; K. Yamashita, Upadhay, Mimori, Inoue, & Mori, 2003). Additionally, Infante et al.

demonstrated that the location of SPARC is a prognostic biomarker for PDAC (Infante et al., 2007).

Patients with SPARC negative stroma have significant longer median survival rates than patients with

positive SPARC stroma (p value < 0.001) (Loosen et al., 2017).

CA19-9

As discussed above, CA19-9 is a potential diagnostic marker. Moreover, high CA19-9 level also has a

negative association with survival duration (Ballehaninna & Chamberlain, 2012). In a recent study,

using univariate Cox Proportional Hazards Model for overall survival, it has been shown that CA19-9

had a hazard ratio of 1.37 with 95% confidence interval from 1.00 to 1.88 (G. Luo et al., 2017).

However, as discussed above, CA19-9 has several drawbacks which limit its applications. Moreover, the

prognosis performance of CA19-9 is far from ideal.

Quantitative Image biomarkers

As a non-invasive tool, CT is commonly used in PDAC diagnosis and management (Adamska et al.,

2017b). It is used to assess stages and resectabilities. CT is also utilized to assess response to systemic

therapies. Nevertheless, beyond RECIST criteria, quantitative measurements were not routinely used.

Recently, it has been found that several quantitative imaging features are associated with PDAC

prognosis for resectable patients. Eilaghi et al. found that, the quantitative imaging feature

“Dissimilarity” and “Inverse difference normalized” are associated with patients’ overall survival

(Eilaghi et al., 2017). A recent multi-cohort study from Khalvati et al. confirmed the potential of

quantitative imaging features in the PDAC prognosis. It has been shown that "Original_glcm_

SumEntropy" and "squareroot_glcm_ClusterTendency" are associated with overall survival in resectable

PDAC patients (Khalvati, Zhang, Baig, et al., 2019). Quantitative imaging biomarkers have shown

substantial potential in PDAC prognosis. The analytic pipeline of quantitative imaging biomarkers will

be discussed in the following sections of this chapter.

11

1.2 Radiomics: analysis of quantitative imaging markers

1.2.1 Introduction

Modern medicine is moving towards personalized medicine, where diagnosis, treatment, and prognosis

of the disease are modified for each patient. In clinical practice, radiology plays a critical role in

providing valuable information for physicians to detect, differentiate and diagnose abnormal conditions

in patients (Yip & Aerts, 2016). Radiological images contain a vast amount of information on lesions,

including shape and texture. However, human interpretation of medical images alone is potentially

biased and often fails to discover the entirety of potentially informative data.

Radiomics is a new field of study, which aims to discover and translate this un-decoded information

from medical images (V. Kumar et al., 2013). Radiomics is defined as the extraction and analysis of a

large number of quantitative features from the medical images. These features can offer comprehensive

information on texture, intensity, heterogeneity, and morphology(van Griethuysen et al., 2017).

Studying these features, researchers have found that many features have significant associations with

clinical outcomes and gene-expression levels (Yiming Li, Qian, et al., 2018; Papp et al., 2018). These

features can be further used to develop diagnostic or prognostic models which may serve as tools for

personalized diagnosis and clinical decision support systems.

By capturing the entire tumor site, radiomics features have the distinct advantage of assessing tissue

heterogeneity (Gillies, Kinahan, & Hricak, 2015). Other clinical procedures such as biopsy only capture

a small fraction of tumors, having significant chances of missing the index tumor (Khalvati, Zhang,

Wong, & Haider, 2019). Hence, it is a challenging task to get a comprehensive mapping of the tumor

using traditional approaches, leading to misinterpretations and non-optimal clinical decisions.

Comparatively, with the ability of “reading” tumors through 3D or 2D images, radiomics could

potentially overcome this challenge (van Griethuysen et al., 2017). In the past decade, radiomics studies

have been conducted on multiple diseases, including different types of cancers. Through these studies,

radiomics has shown its potential in disease diagnosis, prognosis, and prediction of treatment responses

(Keek, Leijenaar, Jochems, & Woodruff, 2018). Details about traditional radiomics analytics pipeline

will be discussed in the following sections.

Part of this section is modified from:

Zhang. Y et al., Radiomics-based Prognosis Analysis for Non-Small Cell Lung Cancer, Sci Rep, 2017.

12

1.2.2 Pipeline

Radiomics analytics pipeline consists of several stages (Khalvati, Zhang, Wong, et al., 2019). Figure 1.1

below shows a typical pipeline for a radiomics study. First, raw images are pre-processed and segmented

to annotate Regions Of Interest (ROIs) such as cancerous regions (tumors) (Yucheng Zhang,

Oikonomou, Wong, Haider, & Khalvati, 2017). This is usually done manually or through automatic

segmentation algorithms. Next, a large number of quantitative imaging features is extracted from these

ROIs (Yucheng Zhang et al., 2017). Last, endpoint data (i.e., clinical outcomes such as disease

recurrence) is entered into the database, providing information for feature selection and model building

process.

Figure 1.1: Traditional Radiomics Pipeline

As mentioned above, radiomics studies need input as raw medical images and patients’ outcomes.

Images come from different modalities including computed tomography (CT), magnetic resonance

imaging (MRI), and positron emission tomography (PET). Raw images from these modalities are often

saved as DICOM file (Digital Imaging and Communications in Medicine), which contains the images

and “header” information. Currently, most scientific programming languages can read DICOM images

using specific packages or modules, enabling the further steps in the radiomics analytics pipeline.

1.2.3 Segmentation

As the first step, segmentation provides the ROIs’ boundaries, which are typically the lesion presented

on the medical images. In addition, it has been found that segmenting not only the lesions but also

13

peripheral zones around the lesions can boost the performances (Hambarde et al., 2019). As discussed

above, segmentation of the lesions is usually performed manually by radiologists. This is not only time-

consuming but also introduces undesirable variations (Owens et al., 2018). It has been found that several

radiomics features are sensitive to variations in segmentation (Owens et al., 2018). A recent multi-reader

study confirmed that some radiomics features have low inter-reader reliability (Khalvati, Zhang, Baig, et

al., 2019). These findings show the critical need for developing reliable and automated segmentation

methods. Although radiologists’ contours are still considered as gold standards, automated segmentation

methods have been developed rapidly in the past few years. Details of these segmentation methods are

discussed below (Litjens, Kooi, Bejnordi, Setio, et al., 2017; Oktay et al., n.d.; Razzak, Naz, & Zaib,

n.d.).

Traditional thresholding-based segmentation uses pre-defined thresholds (Abdullah, Hambali, & Jamil,

2012). Pixels which have higher or lower values than the threshold are selected and labelled. This

approach needs prior knowledge of the image information and modality (Litjens, Kooi, Bejnordi, Setio,

et al., 2017). With an accurate threshold, segmentation could achieve acceptable performance on

“simple” tasks including lung and bone segmentation (Owens et al., 2018).

Although this method is intuitive, performance is limited due to the complex nature of human tissues.

Modern threshold-based segmentation often contains edge-based, region-based, or hybrid modifications

after the initial segmentation (Leo, Lim, & Suneetha, 2009; Sharma et al., 2010). Edge-based

modification methods use non-maximal suppression and hysteresis thresholding to suppress pixels

which are potential outliers, smoothing the boundary and eliminating holes inside the ROI.

In region-based segmentation, a radiologist would provide an initial point of the lesion. Then, all the

pixels adjacent to the point and having similar intensities would be selected and labelled (Junfeng &

Yunyang, 2012). Although it has been shown that the region-based approach has superior performance

in tumor segmentation, this approach needs additional manual inputs, and extensive pre-processing,

which limit its applications. Additionally, since these segmentation algorithms solely depend on pixel

intensities, artifacts and partial volume effects have significant impacts on their results. Consequently,

due to these limitations, researchers have started to investigate other segmentation methods including

deep learning based segmentation which will be discussed in the following sections (Kaur & Kaur,

2014).

14

1.2.4 Feature extraction

The second step of the radiomics pipeline is feature extraction. In general, features can be categorized

into two groups, "semantic" and "agnostic"(Gillies et al., 2015). Semantic features are commonly used

by radiologists, including, size, shape, location, vascularity, and attachments. In contrast, agnostic

features are quantitative features describing the texture and intensity distributions (Khalvati, Zhang,

Wong, et al., 2019). Table 1.3 below summarized common features in recent radiomics studies.

Table 1.3: List of common features (Lambin et al., 2017a)

Semantic First order Second order Higher order

Shape Mean Heterogeneity Fractal dimensions

Location Median Haralick textures Wavelets

Vascularity Entropy

Laplacian

Researchers have worked for decades to expand the feature banks and extract more useful information

from medical images (Aerts et al., 2014). During the past decade, the size of a typical radiomics feature

bank has expanded from less than one hundred to more than few thousands (van Griethuysen et al.,

2017). A more comprehensive feature bank helps to identify more quantitative imaging markers for

diagnosis and prognosis (Aerts et al., 2014; V. Kumar et al., 2013; Parekh & Jacobs, 2016). At the same

time, a higher number of features increases the complexity of the feature map and induces the danger of

false positives or overfitting (Yucheng Zhang et al., 2017).

In addition, researchers often have developed in-house feature banks based on different programming

languages including Python, MATLAB, or C++. Although the features' names are the same, it is

common that the formulas are slightly different, making studies unreproducible (Khalvati, Zhang, Baig,

et al., 2019). PyRadiomics, as an open source feature extraction tool, was developed to address these

challenges(van Griethuysen et al., 2017). It enables basic pre-processing and provides a comprehensive

feature bank for researchers in the radiomics field (Lambin et al., 2017; van Griethuysen et al., 2017).

Currently, the PyRadiomics library implements 120 features (van Griethuysen et al., 2017). These

features can be extracted from the original image, or images derived through other filters (e.g., High

pass filters, or low pass filters). The PyRadiomics library has 19 first order features, including energy,

15

total energy, entropy, minimum, 10th percentile, 90th percentile, maximum, mean, median, interquartile

range, range, mean absolute deviation, robust mean absolute deviation, root mean squared, standard

deviation, skewness, kurtosis, variance, and uniformity (van Griethuysen et al., 2017). These features

describe the distribution of pixel intensities in the ROI. Among these first order features, entropy, which

is a measurement of randomness in image values, has been found to be significantly associated with

overall survival of cancer patients in multiple studies (Ganeshan, Abaleke, Young, Chatwin, & Miles,

2010; Ganeshan, Panayiotou, Burnand, Dizdarevic, & Miles, 2012; Y. Huang et al., 2016; Yucheng

Zhang et al., 2017).

Additionally, PyRadiomics provided formulas for 75 texture features including Sum Entropy and

Cluster Tendency, which have been shown to be significantly associated with PDAC prognosis

(Khalvati, Zhang, Baig, et al., 2019).

1.2.5 Feature analysis and model building

Using open source libraries such as PyRadiomics, researchers are able to extract thousands of features

from a given ROI (van Griethuysen et al., 2017). Following that, the third step of the radiomics analytic

pipeline is feature analysis and model building (Parmar, Grossmann, Bussink, Lambin, & Aerts, 2015;

Yucheng Zhang et al., 2017). Although a vast number of quantitative features can be extracted from

medical images, many of them are simply noise, or highly correlated with other features (Yip & Aerts,

2016). Hence, feature reduction is critical to select useful and unique features, minimizing the

computational cost while increasing the prediction accuracy (Yucheng Zhang et al., 2017).

In general, feature reduction procedures can be categorized as supervised or unsupervised methods

(Parmar, Grossmann, et al., 2015). In supervised feature selection, such as filtering feature selection,

features are selected based on their discriminative value of outcomes. Conventional supervised feature

selection methods include parametric or semi-parametric tests such as t-tests, u-test, and Cox

Proportional Hazards Model (Yucheng Zhang et al., 2017). For binary outcomes, researchers often

compare the distribution of features for positive and negative groups such as disease recurrence and non-

recurrence groups. If these two groups have a significant difference in terms of feature value, then the

feature will be considered useful. Based on different types of outcomes and assumptions (binary or

16

multinomial, normal distribution or non-normal distribution), ANOVA, t-tests, or Wilcoxon U tests are

applied accordingly (Yucheng Zhang et al., 2017). In early radiomics studies, many researchers failed to

check the assumptions of these tests(Coroller et al., 2015a). Furthermore, although these tests are

straightforward, the multiple testing problem is inevitable with fast-growing feature space (Yip & Aerts,

2016). Consequently, these limitations restrict the applications of supervised feature selection methods.

In contrast, unsupervised feature reduction is based on dimensionality reduction algorithms, maintaining

more information in the dataset. Among non-filtering feature selection methods, Principal Component

Analysis (PCA) is the most popular approach. It selects a small number of uncorrelated variables, called

“principal components”, which could explain most of the variation in the data (Abdi & Williams, 2010).

A similar approach is called Independent Component Analysis (ICA), which removes not only

correlations among the variables, but also higher-order dependencies. Other common unsupervised

feature selection methods are zero variance (ZV) and near zero variance (NZV). These two algorithms

remove features with zero or near zero variance (Kuhn, 2008). In radiomics studies, NZV and ZV are

particularly practical. When the ROI is minimal (e.g., 4 pixels), the open source libraries might fail to

extract meaningful features, resulting in columns of zeros or missing values. In this condition, ZV and

NZV methods are extremely valuable since they can efficiently remove those features (Yucheng Zhang

et al., 2017).

After selecting useful features, model building is the last step in the traditional radiomics analytics

pipeline. Radiomics-based prognosis models utilize the quantitative imaging features for predictions of

outcomes (e.g. Survival vs. Death) (Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). In

the machine learning domain, classification is considered as a supervised learning task of inferring a

function from labelled training data (Yucheng Zhang et al., 2017). The classification algorithm analyzes

the training data and outcomes (labels), minimizing the lost function and building predictive models.

Common classification models in radiomics studies include Random Forest and generalized linear

model. The Random Forest model is generally developed by building hundreds of small decision trees

(Breiman, 2001; Hawkins et al., 2016). Each decision tree receives a subset of the full data. Under this

condition, although each tree has limited predictive power, the ensembled forest gains the ability to

classify outcomes. The Random Forest model has several advantages. For most classification tasks,

Random Forest works well without tuning any parameters (Parmar, Grossmann, et al., 2015).

Additionally, due to the subsampling, Random Forest tends not to overfit. Finally, a Random Forest

17

model can handle not only linear features but also non-linear or categorical features, making it suitable

for radiomics studies.

However, since training a Random Forest is similar to a black box process, logistic regression is often

preferred as an intuitive classification method (Fernández-Delgado, Cernadas, Barro, Amorim, &

Amorim Fernández-Delgado, 2014; H. Wang et al., 2010). Ordinary linear regression fails to model the

probabilities of binary outcomes, since probabilities range from 0 to 1. As a type of generalized linear

model, logistic regression does classification by applying a logit transformation of probability, extending

its range (Sperandei, 2014). In general, logistic regression is easy to understand and requires fewer data

to achieve acceptable performance. Given that, researchers often choose between Random Forest and

generalized linear model when they are trying to build a radiomics based classification models.

It is worth to note that many clinical outcomes have unbalanced ratios (e.g. survival outcome for cancers

with poor prognosis), which do not meet the assumption of balanced endpoints for most machine

learning algorithms. To tackle this problem, subsampling methods, including down-sampling, up-

sampling, and Synthetic Minority Over-sampling Technique (SMOTE), are applied (Blagus et al.,

2013). Down-sampling method down-samples the “majority” cases during model training while up-

sampling method up-samples the minority cases. These two methods are intuitive but either lose

information or create a “non-universal decision region” since the generated data points are duplicates. It

has been shown that in radiomics studies, these two methods are not beneficial for the prognosis models

(Yucheng Zhang et al., 2017).

On the other hand, as an enhanced sampling method, SMOTE creates “simulated samples” based on

Euclidian distance for variables (Blagus et al., 2013). As a result, the synthetic cases have attributes with

values similar to the existing cases and are not merely replications as provided by oversampling. Thus,

SMOTE can effectively increase the representation of the minority class while reflecting the structure of

the original samples. Zhang et al. showed that, in radiomics based prognosis model, adding SMOTE will

significantly improve the model’s performance (Yucheng Zhang et al., 2017).

1.2.6 Current progress

In the following sections, recent radiomics studies for cancer diagnosis, prognosis, or treatment response

will be discussed.

18

Lung Cancer

A large number of representative radiomics studies are based on lung cancer. For lung cancer diagnosis,

Kumar et al. trained a radiomics feature based classification model using CT images and achieved

sensitivity and specificity of 79.6% and 76.1% respectively (D. Kumar et al., 2015). In another study,

researchers trained a radiomics feature based classification model using low-dose CT to predict

malignant nodules, achieving an accuracy of 80% (Hawkins et al., 2016; Y. Liu et al., 2017). The

features from CT images also showed significant association with TNM staging. In a study of 1019

patients, Aerts et al. found that, 238 features have associations with cancer staging (Aerts et al., 2014;

Parmar, Leijenaar, et al., 2015). A recent publication from Zhou et al. confirmed these findings (H. Zhou

et al., 2018). Also, radiomics studies for lung cancer are not only limited to CT images but also extend

to PET/CT as well. Wu et al. found that features from PET images are also associated with cancer

staging (Wu et al., 2016).

Head and neck

Similar to lung cancer, several radiomics studies were conducted for head-and-neck cancer. It has been

found that CT and MR based features have significant associations with staging in head and neck

cancer, primarily features from contrast-enhanced T1-weighted (T1w) MR and T2-weighted (T2w) MR

images (Ren et al., 2018; Z. Zhou et al., 2018). For Nasopharyngeal Cancer (NPC), it has been found

that radiomics features derived from MRI show prognostic values (B. Zhang et al., 2017a).

Additionally, few features also showed significant association with patients’ responses to chemotherapy

and radiotherapy (Gabryś, Buettner, Sterzing, Hauswald, & Bangert, 2018). Further research showed

that, compared to traditional models using clinical factors, radiomics models have better prognosis

performance for patients with high-grade osteosarcoma. These findings further confirmed the potential

of radiomics in translational research and precision medicine (B. Zhang et al., 2017b).

Brain tumors

It has been shown that several genetic markers are associated with prognosis, including P53, ATRX, and

MGMT (Kickingereder et al., 2018; Yiming Li, Liu, et al., 2018; Yiming Li, Qian, et al., 2018; Xi et al.,

2018). A recent study found that, adding radiomics features to the genomics prognosis model further

improved its performance (Itakura et al., 2015). As a quantitative description of tumors’ phenotypes,

19

some radiomics features are also associated with these genomics markers (Itakura et al., 2015). Bai et al.

showed that, without adding genomics information, radiomics features alone can provide accurate

staging in brain tumors (Bai et al., 2016). Recent studies confirmed these findings in prognosis for

gliomas using features from PET and MRI (Papp et al., 2018; Pérez-Beteta et al., 2018).

Colorectal cancer

Determining genetic mutation is an essential step in colorectal cancer management as stated in NCCN

(National Comprehensive Cancer Network) guideline (Benson et al., 2018). However, genetic testing

has an extra cost and introduces unfavourable waiting time for cancer patients. It has been shown that

radiomics features may solve the problem. Using radiomics features from preoperative CT images, Yang

et al. built a classifier for these genetic mutations with AUC at 0.87 (L. Yang et al., 2018). With larger

sample size and multi-cohort validation, radiomics features based models have the potential to replace

genetic testing in colorectal cancer, saving both time and money for patients.

Knowing patients’ responses to chemotherapy is also vital for healthcare professionals in designing

personalized treatment plans. It has been shown that, 15%-27% of patients achieved complete responses

to chemotherapy or radiation therapy, avoiding surgery (Maas et al., 2010; Sanghera, Wong,

McConkey, Geh, & Hartley, 2008). However, assessing the patient’s response to colorectal cancer is

challenging. Using radiomics features from T2w and (diffusion-weighted images (DWI), several

radiomics features based models achieved high AUC ranging from 0.93 to 0.98 (Horvat et al., 2018;

Nie et al., 2016). These findings suggest the potentials of using radiomics features to assess patients’

responses to therapies before surgery.

Besides assessing responses, radiomics models were also built to differentiate low-risk and high-risk

patients with colorectal cancer. A recent study found that radiomics model was able to differentiate high

or low-risk patients based on their preoperative CT with AUC of 0.84 (Meng et al., 2018).

Pancreatic cancer

Several radiomics studies have been conducted in pancreatic cancer domain. In a single cohort study,

Eilaghi et al. found that features named “dissimilarity” and “inverse difference normalized” are

associated with overall survival in patients with resectable Pancreatic Ductal Adenocarcinoma (Eilaghi

20

et al., 2017). In other studies, radiomics features were found to be predictive of patients’ responses to

chemoradiation therapy (X. Chen et al., 2017; Cozzi et al., 2019). In a recent multi-cohort study,

Khalvati et al. found that features from the PyRadiomics feature bank can be fused into a signature

which is predictive for overall survival (Khalvati, Zhang, Baig, et al., 2019). Further validations of these

features and signatures are needed to assess their prognosis performance.

1.2.7 Limitations of traditional radiomics analytic pipeline

Although previous studies have found several radiomic features which have significant associations with

clinical outcomes including survival or recurrence for different types of cancer, traditional radiomics

pipeline have a few drawbacks including multiple testing, sample size, performance, interpretability,

reproducibility, and reliability (Lambin et al., 2017; Yip & Aerts, 2016).

Multiple testing, also called multiple comparison problem, is one of the common flaws of radiomics

studies. It occurs when researchers are conducting a set of statistical inference simultaneously, inducing

potential false positive findings (Yucheng Zhang et al., 2017). Since feature banks are large, thousands

of features are extracted and tested. Although a higher number of features provides more information

about medical images, the number of testing also goes up, making the multiple comparison problem

even worse. Setting α as 0.05, we expect to see five significant results from 100 tests using random data.

Hence, in the radiomics field, since the number of features is usually large (e.g., above 1000), the impact

of the multiple testing problems is significant and unavoidable. It is more problematic when we consider

the probability of meeting at least one false positive. Given α as the false positive rate for a single test,

and m representing the number of testing, the formula for calculating the error is shown below:

𝐹𝑎𝑚𝑖𝑙𝑦 − 𝑤𝑖𝑠𝑒 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 (𝐹𝑊𝐸𝑅) = 𝑃𝑟𝑜𝑏(𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = 1 − (1 − α)𝑚

It is clear that, with 100 tests, the chance of meeting at least one false positive is 0.9941. Nevertheless, in

most radiomics studies, the number of tests is much higher than 100. Hence, without multiple testing

control, there is a high chance of having false-positive findings. Bonferroni, as a standard multiple

testing control method, was designed to control this FWER (“Etymologia: Bonferroni correction.,”

2015). Given the probability formula,

𝐹𝑊𝐸𝑅 = 1 − (1 − α)𝑚

21

we can derive that:

α′ = 1 − (1 − 𝐹𝑊𝐸𝑅)1/𝑚

Under Bonferroni’s correction, we would reject the null hypothesis when the p value is below α′. Given

m = 100, and FWER at 0.05, Bonferroni method gives a new 𝑎′ as 0.000513. In a typical radiomics

study where more than one thousand features were tested simultaneously, the critical value will be very

small to ensure the family-wise error rate (FWER) is still maintained on 0.05 level.

However, Bonferroni correction assumes that each test is independent, which is not necessarily true in

radiomics studies since many features share similar formulas. In some cases, a feature can be a linear

combination of other features. Under this condition, the Bonferroni correction may be too conservative,

leading to more false negatives (Type II errors). Thus, in recent studies, an increasing number of

researchers used FDR (False Discovery Rate) control defined by Benjamini and Hochberg (S.-Y. Chen,

Feng, & Yi, 2017; Horvat et al., 2018). The definition of false discovery rate is presented below:

𝐹𝐷𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

FDR control offers an approach which may increase testing power while setting up a limit for error rate

(S.-Y. Chen et al., 2017). In practice, the threshold of FDR control can be calculated as:

𝑇𝐵𝐻 = max {𝑃𝑖; 𝑃𝑖 ≤ 𝑎𝑖

𝑚, 0 ≤ i ≤ m}

Compared to Bonferroni, FDR control offers more power, which is favoured by researchers in the

radiomics field. Several studies have been published using the FDR control (Coroller et al., 2015;

Khalvati, Zhang, Baig, et al., 2019). A systematic review paper also suggested that, in future radiomics

studies, FDR control should play an important role (Parekh & Jacobs, 2016; Yip & Aerts, 2016).

Furthermore, with a large number of features, radiomics studies generally have small sample sizes,

leading to the “Large P, small N” problem, where the number of features is much larger compared to the

sample sizes (Yucheng Zhang et al., 2017). Most radiomics studies have samples which have less than

22

500 patients. A recent study was published using images from only eight patients (Nguyen et al., 2018).

Limited sample size limits the statistical power for tests, and further hinders the performance of any

radiomics based models. A list of recent studies is presented in Table 1.4 along with the sample size and

performance information.

Table 1.4: List of recent radiomics studies and their performance in AUC

Domain Sample size Performance Reference

Pancreatic (PET) 139 For Overall survival, AUC: 0.66 (Cui et al., 2016)

Breast (MR) 89 For cancer recurrence, AUC: 0.88 (H. Li et al., 2016)

Lung (CT) 282 For Overall Survival, AUC: 0.72 (Y. Huang et al., 2016)

Lung (CT) 113 For distant metastasis, AUC: 0.67 (Huynh et al., 2016)

Lung (CT) 182 For distant metastasis, AUC: 0.61 (Coroller et al., 2015)

Lung (CT) 196 For cancer screening, AUC: 0.83 (Y. Huang et al., 2016)

Lung (CT) 422 For Overall Survival, AUC = 0.65 (Aerts et al., 2014)

Colorectal (PET) 326 For Overall Survival, AUC = 0.74 (Y.-Q. Huang et al.,

2016)

Oesophageal (CT) 106 For treatment response, AUC =

0.75 (Cunliffe et al., 2015)

Oesophageal (PET) 217 For Overall Survival, AUC = 0.77 (van Rossum et al.,

2016)

It is clear that most studies have sample size under 500 and AUC below 0.8. In terms of performance,

most studies are still far from the clinical standard. Multi-center collaborations may solve the problems,

and several recent studies were conducted in this manner (Aerts et al., 2014; Khalvati, Zhang, Baig, et

al., 2019). However, ethics approvals and protection of patients’ privacy make it difficult to conduct

multi-center studies.

23

Interpretability is another issue with current radiomics studies. Though hundreds of significant features

have been found, researchers and clinician still have limited understanding of the biological nature of

these features (Gillies et al., 2015). Compared to semantic features, radiomics features generally lack

visualized descriptions (Morin, 2018). This makes healthcare professionals more reluctant for the

clinical integration of radiomics (Morin, 2018). Without a doubt, future research for radiomics should

focus on this limitation.

Last but not least, reproducibility and reliability are other limitations for Radiomics studies (Traverso,

Wee, Dekker, & Gillies, 2018). As discussed in the pipeline, radiomics studies involve image

acquisition, segmentation, feature extraction and feature analysis. The complex process adds a

significant amount of variation (Khalvati, Zhang, Baig, et al., 2019; B. Zhao et al., 2016). Different

centers have different CT or MRI scanners which might have different signal to noise profiles.

Additionally, manual segmentation heavily depends on the experience of radiologists. Furthermore,

different feature banks or different programming languages may also affect feature extraction (van

Griethuysen et al., 2017). Finally, feature preprocessing before the analysis, and parameters used in the

classification model would affect the model’s performance as well. In the end, these variations lead to

non-reproducible studies (Lambin et al., 2012). Fortunately, researchers have realized the issue and

started working on IBSI (Image Biomarker Standardization Initiative) and Radiomics Quality Score

System (Sanduleanu et al., 2018; Zwanenburg, Leger, Vallières, & Löck, 2016). These efforts would

improve the research quality and reproducibility of radiomics studies.

1.3 Deep learning in medical imaging

1.3.1 Neural Network and CNN

As discussed above, radiomics has been developed for decades and the performances of the radiomics

model are closing to their plateau (Lao et al., 2017). As deep learning has gained public attention, deep

learning techniques are playing a more important role in medical imaging studies (Litjens, Kooi,

Bejnordi, Setio, et al., 2017; Thomaz, Carneiro, & Patrocinio, 2017; van Griethuysen et al., 2017; R.

Yamashita et al., 2018). As a deep learning architecture which has specialized for imaging-related tasks,

24

Convolutional Neural Networks (CNNs) have become the preferred method in medical imaging studies

(R. Yamashita et al., 2018).

Development of CNNs started in 1962 when Hubel and Wiesel found that some neurons in the visual

cortex of brain only respond to edges of certain orientation (Hubel & Wiesel, 1968). In 1980, inspired by

this, Fukushima proposed a self-organizing neural network model for pattern recognition (Fukushima,

1980). Furthermore, using backpropagation, Yann LeCun developed the LeNet, which is considered as

the predecessor of modern CNNs models (LeCun et al., 1990). However, the performances of early

CNNs were limited. Although CNNs were tested to be effective in handwritten digit recognition tasks,

traditional feature-based machine learning model performs better in general tasks.

In 2012, AlexNet from Hinton lab reversed this trend. By introducing new activating function “ReLU”

and dropout, AlexNet was deeper (having more layers) compared to previous CNN models. In

ImageNet-2012, AlexNet achieved top-5 error rate at 18.9%, which is significantly lower than that of the

previous models (Krizhevsky et al., 2012). The success of AlexNet changed scientists’ minds and

induced a “deep learning revolution”. To better implement CNN architectures in medical image studies,

understanding the components of the Convolutional Neural Networks is important. As such, a detailed

discussion will be provided in the following sections.

A typical CNN consists of multiple layers, including convolutional layers, pooling layers, and fully

connected layers. As input images going through these layers, images are converted into feature maps,

enabling the CNN to make classifications (B, 2013; Krizhevsky et al., 2012). A simplified example of

the CNN architecture is shown in Figure 1.2 below.

25

Figure 1.2: Typical CNN architecture

Convolution layer

Convolution layer is the foundation of the CNN architecture (Krizhevsky et al., 2012). Convolution

stands for a linear operation where a small array, kernel, is applied across the input images. Since digital

images are saved in an array of numbers, convolution operation would generate a feature map as shown

in Figure 1.3 (R. Yamashita et al., 2018).

26

Figure 1.3: Graphical presentation of convolution operations

A. Convolution operations for a 55 input tensor, step 1

B. Convolution operations for a 55 input tensor, step 2

C. Convolution operations for a 55 input tensor, step 9

Results of convolution operations can be influenced by different parameters, including weights in the

kernel, stride, size of the kernel, and padding (R. Yamashita et al., 2018). First, changing the weights in

a kernel will change the final feature map. In the training process, weights in kernels would be tuned so

that the generated features would provide useful information. Second, since the kernel moves on the

input image in a step by step manner, the distance between each step, which is defined as stride, is

critical (Dů et al., n.d.). A larger stride will induce faster down-sampling of the input image. Third, the

27

size of the kernel is also an important hyperparameter. It ranges from the most common 3 × 3 to 5 × 5 or

even 7 × 7. Smaller kernel size will generally generate local features, and slowly reduce image

dimensions, allowing the networks to be deeper which usually offers better performance (H. Liu, Li, Lv,

& Huang, 2017). However, larger kernels have a larger receptive field, and reduce image dimensions

quickly. Taking an example from Figure 1.3, to reduce this 5×5 image tensor to 1×1, one can choose

between two layers of 3×3 kernel or one 5×5 kernel. The former approach is more popular since it has

fewer weights (2×3×3 compared to 5×5) and offers another layer which may provide better performance

(Litjens, Kooi, Bejnordi, Setio, et al., 2017).

Last but not least, padding is another critical factor. Padding was developed to control the dimension

reduction of input images, (Krizhevsky et al., 2012; R. Yamashita et al., 2018). Zero-padding is the

most common type of padding. It adds columns and rows of zero on each side of the input images as

shown in the Figure 1.4 below (R. Yamashita et al., 2018). After padding, convolution operations can be

done without reducing image dimensions, so that the model can afford deeper layers.

Figure 1.4: Graphical representation of zero-padding

Activation layer

Feature maps generated by convolution layers will often pass through the following activation layers (R.

Yamashita et al., 2018). The most common activation function is ReLU (rectified linear unit) which

gives output based on the following formula:

28

𝑓(𝑥) = max (0, 𝑥)

The rationale behind ReLU is that it provides a non-linear transformation (Krizhevsky et al., 2012; R.

Yamashita et al., 2018). Without the non-linear activations, the deep learning network is essentially a

linear model, which is not suitable for solving real-world non-linear relationships. Other common non-

linear activation functions include Sigmoid or logistic activation function, hyperbolic tangent activation

function, and derivatives from the original ReLU. As discussed above, the logit transformation extends

the range. Thus, a “reversed logit transformation”, also called sigmoid function, would restrict the range

based on the following formula:

𝑓(𝑥) = 1

1 + 𝑒𝑥𝑝 (−𝑥)

Pooling layer

Pooling layers are commonly applied in CNNs to reduce the number of parameters that need to be

trained. Max-pooling is one of the most common pooling operations, which extract patches from input

feature images and return the maximum value in each patch (R. Yamashita et al., 2018). The most

frequently used patch size is 2×2, which down-samples the dimension by 2, inducing a significant

reduction of trainable parameters.

Figure 1.5: Graphical representation of max pooling

29

Fully Connected layer

Figure 1.6: Graphical representation of Fully Connected Layers

Through a series of convolution, activation, and pooling layers, input images will be transformed into

2D or 3D feature maps. These feature maps will be flattened as shown in Figure 1.6 and used as input of

fully connected layers (FC) (Krizhevsky et al., 2012; LeCun, Bengio, & Hinton, 2015). FC was named

as fully connected layers since neuron in FC has full connections to all activations in the previous layer.

For each neuron in FC layers, it will process input vector x and return the output using the formula

below:

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑔 (𝑊𝑥 + 𝑏)

In the formula, x is the input vector. b stands for bias vector, while W is a weight matrix. Finally, g is

the activation function. Through a series of calculations, final layers of FC can generate probabilities or

classifications of the target outcomes.

30

Training a CNN

As discussed above, the weights in kernels and FC layers have a significant impact on the final outputs.

Hence, mathematically, training a CNN means finding optimal weights so that the difference between

the model’s output and ground truth can be minimal. Loss functions are mathematical formulas to

measure the difference (LeCun et al., 2015). The most common loss function for classification is Cross

Entropy Loss, and the formula is presented below:

𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠 = −(𝑦𝑖 log(�̂�𝑖) + (1 − 𝑦𝑖) log(1 − �̂�𝑖))

Where 𝑦𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑎𝑠 0 𝑜𝑟 1 𝑎𝑛𝑑 �̂�𝑖 𝑠𝑡𝑎𝑛𝑑𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

It is clear that cross entropy loss penalizes confident but wrong predictions. Hence, minimizing cross

entropy loss will shape the model’s output to be similar to ground truth (R. Yamashita et al., 2018). In

practice, model performance can be measured by loss functions in a forward manner using training data.

While backpropagation and gradient descent algorithms allow the model to update weights in kernels

and FC through this process (Lecun, Bottou, Bengio, & Haffner, 1998). The gradient descent algorithm

can be described by the following equation.

𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 { 𝑤 ← 𝑤 − 𝛼∂L

∂w}

Figure 1.7: Graphical representation of the gradient descent algorithm

31

In this equation, L is the loss function that needs to be minimized. On the other hand, w is the weight

vector, while 𝛼 stands for learning rate. For every iteration, weights will be updated by subtracting the

gradient of the loss function with respect to the weights (R. Yamashita et al., 2018). It is clear that, for

larger 𝛼, weights will be updated in a larger step towards minima. However, a large learning rate can

overshoot the minimum, failing to converge or even diverge. It is worth to note that, if ∂L/ ∂w is small,

learning will also be slow. In other words, if the gradients are “vanished”, the model will not be trained

successfully (He, Zhang, Ren, & Sun, 2015). In a deep learning model, since the gradients of early

layers are obtained by multiplying the gradient of later layers, the gradient vanishes quickly. Hence, a

cap exists for the depth of traditional CNNs.

1.3.2 ResNet

It was hypothesized that, if a CNN can have a higher depth, the network will be able to extract and

utilize more complex features from images. However, it has been found that a 20-layer CNN has a lower

error rate compared to that of a 56-layer CNN (He et al., 2015). This is due to the vanishing gradient

problem as discussed above. To address this, He et al. developed a new architecture called residual

block as shown below (He et al., 2015).

Figure 1.8: Graphical representation of identity path (He et al., 2015)

Compared to traditional CNNs, residual blocks have another connection called “identity shortcut

connection” which skips layers and transmits information. By doing so, the vanishing gradient is

controlled. In image recognition tasks, 34 layers ResNet outperforms other traditional CNNs by a

significant margin (He et al., 2015). In ImageNet classifications, ResNet achieved top-5 error rate at

3.57% which is far better than that of its predecessor: AlexNet (Krizhevsky et al., 2012).

32

1.3.3 Transfer learning

As discussed above, training a CNN means tuning weights in kernels and FC layers. However, since

there is a large number of learnable parameters, large sample size is required to successfully train a

CNN. It has been shown that the scale of a deep learning model and the size of the required data has a

linear relationship (Tan et al., 2018). To solve the complex tasks in the medical imaging domain, a large

amount of data is needed. However, collecting clinical data is time-consuming and expensive (Tan et al.,

2018). As discussed above, small data problem has become a critical obstacle in most medical imaging

studies, especially for rare diseases (Yip & Aerts, 2016). Even for common diseases, ethical approvals

and expert annotations are required before any experimental studies. Without a doubt, this is a time-

consuming process. Transfer learning, as a recently developed method, may offer an alternative solution

to the sample size limitation.

Transfer learning is defined as improving the learning in the target task by leveraging knowledge from

the source domain (Torrey & Shavlik, n.d.). It relieves the need for a large sample size, enabling

researchers to train a successful model using limited data (Tan et al., 2018). Tan et al. provided a

mathematical definition of transfer learning as shown below:

Definition of transfer learning: Given a learning task 𝑇𝑡 based on 𝐷𝑡, and we can

get help from 𝐷𝑠 for the learning task 𝑇𝑠. Transfer learning aims to improve the

performance of the predictive function 𝑓𝑡(·) for learning task 𝑇𝑡 by discovering and

transferring latent knowledge from 𝐷𝑠 and 𝑇𝑠 where and/ or 𝑇𝑠 ≠ 𝑇𝑡. In addition, in

the most case, the size of 𝐷𝑠 is much larger than the size of 𝐷𝑡 , 𝑁𝑠 > 𝑁𝑡 (Tan et

al., 2018).

According to Tan et al., deep transfer learning methods can be divided into four categories, namely,

instance-based deep transfer learning, mapping-based transfer learning, network-based transfer learning,

and adversarial based transfer learning (Tan et al., 2018). In medical imaging-related tasks, network-

based deep transfer learning is the most relevant. Details of this transfer learning method will be

discussed below.

The network-based transfer learning is defined as reusing a part of, or full network pre-trained in the

source domain, including the structure and weights (Tan et al., 2018). It applies to most CNN based

deep learning models since convolution layers can be considered as feature extractor (Krizhevsky et al.,

2012; Lao et al., 2017; LeCun et al., 2015). Hence, network-based transfer learning is practical for

33

image related tasks. A CNN can be pre-trained on a large dataset namely ImageNet, which contains 14

million images. By doing so, the CNN can extract useful information of shapes, texture and other

features from images using optimized kernels. This ability can be transferred to a new model with a

small target domain by adopting the convolution layers. Depending on the sample size and similarity

between target and source domain, network-based transfer learning can be processed in two ways, fine-

tuning method and fixed feature extraction method as shown in Figure 1.9 (R. Yamashita et al., 2018).

Figure 1.9: Graphical representation of transfer learning in CNN

The fixed feature extraction method is straightforward by freezing convolutional base of the model and

using these convolution layers as a feature extractor (D. George, Shen, & Huerta, 2017). As discussed

above, optimized convolution layers can extract shape and texture information. It has been shown that

top layers of CNN extract general features, while deeper layers can capture details, which is related to

outcome labels (Zeiler & Fergus). Hence, when performing transfer learning, researchers need to

determine the depth of feature extractor. When the target domain and the source domain are similar (e.g.

34

Lung CT versus. Pancreas CT), features can be extracted from deeper layers (LeCun et al., 2015).

However, when disparity exists between target and source data (e.g. pancreas CT vs. natural images),

top layers should be used to generate general features (D. George et al., 2017).

Compared to the fixed feature extraction method, the fine-tuning method is more sophisticated since it

not only adopts convolution layers but also fine-tunes some deeper layers. Consequently, the generated

features are optimized for the target domain (R. Yamashita et al., 2018). However, this method requires

a larger dataset to perform fine-tuning, which limits its applications.

Transfer learning enables “deep features extractions” from images of the target domain (Afshar,

Mohammadi, Plataniotis, Oikonomou, & Benali, n.d.). In medical imaging, this new method is often

called “deep learning-based radiomics” or “deep radiomics”. It is hypothesized that, deep radiomics

would outperform traditional radiomics since CNNs are able to extract outcome-related features. Studies

in deep radiomics have been started recently, and more comprehensive investigations are required on

this topic.

1.3.4 Deep learning in medical imaging research

As deep learning developing at a fast pace, a large numbers of deep learning studies have been published

in the context of medical imaging. Since 2012, more than five hundred papers were published, focusing

on segmentation, object detection, and exam results classification (Litjens, Kooi, Bejnordi, Arindra, et

al., 2017). The following section will provide a brief overview of these studies.

Detecting abnormality from medical images is a routine work for clinicians. However, it is one of the

most labor-intensive tasks (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). To help clinicians work more

efficiently, studies in this subject started years ago when Lo et al. trained a 4-layer CNN for nodule

detection in x-ray images (Litjens, Kooi, Bejnordi, Arindra, et al., 2017; Lo et al., 1995). As an image-

based network, currently, CNN is one of the most popular methods in abnormality. In a recent study,

using 224,316 chest radiographs from 65,240 patients, researchers from Stanford University trained a

121-layer CNN, achieving AUC of 0.94 for detecting pleural effusion and AUC of 0.86 for Atelectasis

detection (Irvin et al., n.d.). It has been found that this CNN based network had similar performance to

human experts (Irvin et al., n.d.). Data for this study has been published and an increasing number of

research groups are working on this challenge, aiming to improve the performance (Irvin et al., n.d.).

35

In another recent study, Lakhani et al. trained AlexNet and GoogleNet for pulmonary tuberculosis

detection using 1007 chest radiographs (Lakhani & Sundaram, 2017; R. Yamashita et al., 2018). The

final network achieved AUC of 0.99 for differentiating tuberculosis from healthy cases (Lakhani &

Sundaram, 2017; R. Yamashita et al., 2018). A large scale study in the Netherlands also confirmed the

potential of CNN based detection system (Kooi et al., 2017). Kooi et al. designed a CNN-based

computer-aided diagnosis tool (CAD) using 45,000 mammography images. In abnormality detection,

this model outperformed traditional feature-based model by a large margin (Kooi et al., 2017).

In lesion detection tasks, CNNs were applied to not only X-ray, CT, and MR images, but also color-

scaled retina images. In EyePACS-1, and Messidor-2 datasets, CNNs reached 97.5% sensitivity and

93.4% specificity in diabetes mellitus (DM) detection (Gulshan et al., 2016; Pratt, Coenen, Broadbent,

Harding, & Zheng, 2016). Another large-scale study using 80,000 retina images achieved 75% accuracy

in exudates, hemorrhages and micro-aneurysm detection (Chandrakumar & Kathirvel, n.d.).

In addition to detection, CNNs can also be trained to differentiate or classify abnormalities into different

categories. Image classification is also one of the first areas in which deep learning made a major

contribution to medical image analysis (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). A study

conducted in Japan confirmed the potential of the CNNs in subgroup classifications. Yasaka et al.

trained a CNN using 55,536 CT images and achieved AUC of 0.92 for differentiating liver masses

(Yasaka, Akai, Abe, & Kiryu, 2018). In another 2000 images dataset, CNNs achieved a 90.1% accuracy

in nodule classification, being significantly higher than that of traditional radiomics approach which had

an accuracy of 61% (Lai & Deng, 2018).

The studies discussed above clearly benefited from the large sample size. However, in a small sample

size setting, with transfer learning, CNNs can also achieve acceptable performance (He, Girshick, &

Dollár, 2018; Pan & Yang, 2010; Yosinski, Clune, Bengio, & Lipson, 2014). In tuberculosis

classification tasks, fine-tuning convolution layers elevated the accuracy rates from 53.4% to 57.6%

(Antony, McGuinness, Connor, & Moran, 2016). Using a similar approach, by fine-tuning the ImageNet

pre-trained model, researchers achieved a near expert performance in skin cancer classifications (Esteva

et al., 2017).

36

In another study, using a pre-trained CNN as a feature extractor, the model achieved 70.5% accuracy in

cytopathology image classifications (Kim, Corte-Real, & Baloch, 2016). Radiomics features were also

added to transfer learning studies. Lao et al. fused transfer learning features with traditional radiomics

features for glioblastoma prognosis and showed improved prognosis performance (Lao et al., 2017). It

has been shown that, compared to training a model from scratch, transfer learning models have superior

performances in terms of accuracy and computation time when the sample size is below 1000 (He et al.,

2018; Menegola, Fornaciali, Pires, Avila, & Valle, 2016). Hence, transfer learning methods will play an

increasingly critical role in future medical imaging research.

For some lesion classification tasks, both local information on lesion appearance and global contextual

information on lesion location are needed (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). To address this

issue, researchers started to develop multi-stream architectures where serval models were built

simultaneously (Yuexiang Li, Shen, Li, & Shen, 2018). Combinations of pre-trained and trained from

scratch CNNs can also work together for better performance (Gao, Lin, & Wong, 2015).

Moreover, deep learning models were applied to segmentation and denoising process (Christ et al., n.d.;

Cires¸ancires¸an et al., n.d.; Litjens, Kooi, Bejnordi, Setio, et al., 2017; Oktay et al., n.d.; Razzak et al.,

n.d.; Tajbakhsh et al., 2017). In 2012, Ciresan et al. developed a deep neural network algorithm for

neuron segmentation (Cires¸ancires¸an et al., n.d.). The network was used as a pixel classifier. It took a

square image (patch) as an input and gave a probability of being the neuron membrane for the central

pixel. At the ISBI 2012 conference, this network won the segmentation challenge (Ronneberger,

Fischer, & Brox, 2015). However, there are two major limitations of this network. The first limitation is

computation time. Since the model only provides the probability of a limited number of pixels in a patch

(a square image), segmentation for a large image needs a large number of patches, resulting in a

significant demand for computation power. Secondly, this network has a trade-off between prediction

accuracy and patch size. Small patches have higher accuracies; however, the network can only see a

little context in small patch settings (Ronneberger et al., 2015).

Ronneberger et al. proposed another deep learning architecture called U-Net, which is built upon a fully

convolutional neural network with two paths (Ronneberger et al., 2015). The first path consists of a

traditional convolutional neural network to capture the context in images (Krizhevsky et al., 2012).

Additionally, the second path is used to enable precise localization using transposed convolution. U-Net

37

outperforms previous deep learning-based segmentation approaches in terms of speed and the

adaptability to small sample sizes (Ronneberger et al., 2015). Because of these characteristics, U-Net has

recently become the most popular segmentation method in medical imaging studies. Table 1.5 below

presents a list of recent segmentation studies using U-Net.

Table 1.5: List of representative segmentation studies in the medical imaging field

Domain Sample size Findings Reference

Pancreas segmentation

(CT)

150 abdominal CT

scans

Dice score: 0.840±0.087

Inference time: 0.179s

(Oktay et

al., n.d.)

Liver and tumor

segmentation (CT)

100 abdominal CT

scans Dice score: 0.943

(Christ et

al., n.d.)

Rectal cancer (CT) 278 patients

For CTV, dice score: 0.934

For bladder, dice score: 0.921

(Men,

Dai, & Li,

2017)

Multiorgan

segmentation (CT)

331 contrast-enhanced

abdominal CT images

For artery, dice score: 0.79

For vein, dice score: 0.73

For liver, dice score: 0.93

For spleen, dice score: 0.91

For stomach, dice score: 0.84

For pancreas, dice score: 0.63

(Roth et

al., 2017)

Retina blood vessel

segmentation 40 colour retinal image Dice score: 0.8142

(Alom,

Hasan,

Yakopcic,

Taha, &

38

Asari,

2018)

Retina blood vessel

segmentation 20 color retinal image Dice score: 0.8373

(Alom et

al., 2018)

Retina blood vessel

segmentation 28 color retinal image Dice score: 0.7783

(Alom et

al., 2018)

Liver segmentation

(CT)

20 venous phase

enhanced CT Dice score: 0.94

(Christ et

al., 2016)


(CT)

147 contrast-enhanced

abdominal CT scans Dice score: 0.897±0.038

(Oda,

Shimizu,

Oda, et

al., 2018)


(CT)

281 clinical CT Dice score: 0.739±0.152

(Oda,

Shimizu,

Roth, et

al., 2018)

Prostate segmentation

(DWI)

104 Patients Dice score: 0.93 (Clark et

al., 2017)

An open data challenge called MICCAI BraTS (Multimodal Brain Tumor Segmentation challenge) has

become a benchmark for deep learning-based segmentation. From 2013 to 2018, models’ performances

are rapidly improving. Dice scores were elevated from 0.74 to 0.91, suggesting the significant

improvement for automated segmentation tasks in the medical imaging field (Anwar et al., 2018;

Isensee, Kickingereder, Wick, Bendszus, & Maier-Hein, n.d.; X. Zhao et al., 2018).

1.3.5 Future direction

As discussed above, deep learning methods gained success in specific medical imaging tasks. However,

deep learning studies are limited by sample size and interpretability. Although it has been shown that, a

state-of-the-art deep learning model can be trained using 1000 samples even without transfer learning,

most available medical image datasets have much smaller sample sizes (Dů et al., n.d.; He et al., 2018;

39

B. Liu, Wei, Zhang, Yang, & Kong, 2017). Cho et al. from Massachusetts General Hospital conducted a

study on the impact of sample size by training CNNs using six different sample sizes ranging from 5 to

200 (Cho, Lee, Shin, Choy, & Do, 2016). Their results confirmed that, in medical imaging, classification

tasks using CT images generally requires a large sample size (n > 200) (Cho et al., 2016). However,

most popular open-source annotated CT datasets have sample sizes smaller than this number (Aerts et

al., 2014; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019; Yucheng Zhang et al., 2017). Sample

size problems limit the application of deep learning models in medical imaging research, especially in

studies for diseases with low incidence rate including PDAC (Ilic & Ilic, 2016; J. Luo, Xiao, Wu,

Zheng, & Zhao, 2013; Siegel, Miller, & Jemal, 2015).

Additionally, interpretation of CNNs is even more challenging compared to radiomics studies where

features are derived from manually defined formulas. Recent studies have attempted to address this

issue. Zeiler et al. found that, top layers of CNNs can extract local patterns, while deeper layers combine

them into more meaningful structures (Zeiler & Fergus). To better visualize how CNNs make decisions,

activation maps were developed by Zhou et al. and Selvaraju et al. (Selvaraju et al., 2016; B. Zhou,

Khosla, Lapedriza, Oliva, & Torralba, n.d.). These studies highlighted that activation maps can help

researchers to establish trust for deep learning models and discern a stronger model from a weaker

network (Selvaraju et al., 2016, 2017). Without a doubt, further research in visual explanations would

facilitate the application of deep learning models in medical imaging research.

Compared to deep learning methods, traditional radiomics has been studied for a much longer period of

time, providing a large number of significant features. These findings should not be neglected.

Researchers hypothesize that, combing those radiomics features with deep radiomics features will

contribute to a stronger model (Afshar et al., n.d.). In the following studies, we aimed to compare the

effectiveness of radiomics and deep radiomics (transfer learning) models in a small resectable PDAC

sample and find the optimal way of fusing these two information sources for better a prognosis. In the

third study, we modified the loss function in a CNN model, allowing it to provide an accurate prognosis

of PDAC patients at any given timepoint. Focusing on resectable PDAC patients, these studies will be

beneficial for designing personalized treatment plans for them.

40

Chapter 2: Aim and hypothesis

2.1 Study 1: Prognostic Value of Transfer Learning Based Features in Resectable

Pancreatic Ductal Adenocarcinoma

2.1.1 Aims

The main aim of this study is to validate and compare the prognosis performance of transfer learning

feature (deep radiomics) extractors and traditional radiomics feature bank in two independent resectable

PDAC cohorts. For both cohorts, CT images, annotations and clinical outcomes were available. We have

built three prognosis models for overall survival using an engineered (pre-defined) radiomics features

bank “PyRadiomics”, and two transfer learning features extractors trained by ImageNet and Lung CT

images (van Griethuysen et al., 2017).

The performances of these three models will be measured and compared by the area under receiver

operating characteristic curve (AUC). Lastly, risk scores generated by these models were tested in Cox

Proportional Hazards models, assessing not only binary classification performance but also the ability to

provide an accurate prognosis (Khalvati, Zhang, Baig, et al., 2019).

Building a high-performance prognosis model using CT images will be beneficial for resectable PDAC

patients. An accurate prognosis model can provide valuable survival information for clinicians, assisting

them in designing an aggressive treatment plan for an aggressive tumor, improving the survival rates for

resectable PDAC patients. Furthermore, as an increasing number of studies shifting to deep radiomics,

this pioneering study will provide valuable information in choosing appropriate feature banks for other

small sample size studies.

2.1.2 Hypothesis

We hypothesized that the transfer learning model trained from Lung CT images will outperform

traditional radiomics based prognosis model and the transfer learning model pre-trained by natural

images. Furthermore, we hypothesized that, deep radiomics features from the transfer learning model

41

can accurately classify patients into low or high-risk groups, helping clinicians to make effective

treatment decisions.

2.1.3 Rationale for hypothesis

It has been shown that, for glioblastoma prognosis, transfer learning models outperformed the traditional

radiomics models (Lao et al., 2017). Compared to traditional radiomics studies where features are pre-

defined, the formula of deep radiomics features can be optimized for specific tasks, leading to a better

performance (D. George et al., 2017; Lao et al., 2017; Tan et al., 2018). However, in the medical

imaging domain, most transfer learning studies use the ImageNet pre-trained model to extract features

(Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, n.d.; Ravishankar et al.;

Yosinski et al., 2014). ImageNet has 14 million images, which are colour scaled and have different

signal to noise profiles compared to CT images. Therefore, we hypothesized that a model pre-trained by

medical images, namely Lung CT images, may improve the prognosis performance for resectable PDAC

patients.

2.2 Study 2: Improving Prognostic Performance through Radiomics and Deep

Learning Features Fusion in Resectable Pancreatic Ductal Adenocarcinoma

2.2.1 Aims

The first aim of this study is to identify the relationship between radiomics and transfer learning

features. We were interested to see if any associations exist between radiomics features and deep

radiomics features. Since deep radiomics studies were criticized for the lack of the interpretability,

testing the association between deep features and manually defined features can provide another

perspective. We wanted to test whether a transfer learning feature extractor can capture the similar

information identified by pre-defined radiomics features.

Secondly, we aimed to find an optimal method to fuse radiomics features and deep radiomics features.

Although transfer learning methods can achieve high performance given limited data, radiomics studies

have been developed for decades. As a result, a large number of radiomics features have been found to

42

be associated with clinical outcomes. Thus, these radiomics features are still valuable in this transition

period and should not be discarded.

On the other hand, as transfer learning-based feature extractors providing an increasing number of deep

radiomics features, the dimensions of the feature map are expanding at an unprecedented speed. Hence,

finding an optimal method for feature fusion will benefit future studies in this field. To address that, we

have built four fusion models for PDAC prognosis and tested their performance in an independent

validation cohort. If the prognosis performance improved by feature fusion, the model will be able to

provide more accurate prognosis information for healthcare professionals. Resectable PDAC patients

will be further benefited from this high-performance prognosis model.

2.2.2 Hypothesis

We hypothesized that, significant correlations exist between deep radiomics and engineered radiomics

features. Additionally, combining engineered radiomic features with transfer learning-based deep

radiomic features will improve the prognosis performance. Finally, ensemble-based fusion method will

outperform feature-based fusion in terms of the prognosis performance.


A recent study confirmed that deep radiomics feature extractors can extract shape and texture

information (Zeiler & Fergus). In addition, several features in the PyRadiomics feature bank were also

designed to extract this information (van Griethuysen et al., 2017). Thus, we hypothesized that there

exist significant associations between radiomics and deep radiomics features. Identifying this association

profile would provide better interpretations for deep radiomics features and facilitate feature fusion as

the next step in building the prognosis model (Gillies et al., 2015; Razzak et al., n.d.).

Additionally, we proposed four feature fusion methods and hypothesized that the model-based feature

fusion method would provide the best overall performance. It has been shown that ensemble methods

are advantageous in alleviating the small sample size problem by incorporating multiple classification

models to reduce the potential of overfitting (P. Yang, Yang, Zhou, & Zomaya, n.d.). In a typical small

43

sample size setting (n=98), we therefore hypothesized that ensemble-based feature fusion would

outperform other fusion methods.

2.3 Study 3: CNN-based Survival Model for Pancreatic Ductal Adenocarcinoma in

Medical Imaging

2.3.1 Aims

In this study, we aimed to extend the application of CNNs in medical imaging from a binary prediction

of survival to a precise prognosis at any given time point. For cancers with poor prognoses including

PDAC, five-year survival rates are low. Hence, a binary prediction of survival provides limited

additional information for clinicians. However, offering a personalized survival probability curve with

respect to time will be more informative. Nevertheless, traditional probability mapping methods (e.g.

Cox Proportional Hazards Model) often have linearity assumptions, which limit their applications. In

this study, we utilized a modified loss function, built a CNN-based transfer learning survival model

(CNN-Survival), and compared the performance of this model to a traditional radiomics model using the

concordance index.

2.3.2 Hypothesis

We hypothesized that the CNN-Survival model will outperform the traditional radiomics based Cox

Proportional Hazards Model and provide a better mapping for patients’ survival patterns.


As discussed above, due to the presence of non-linear activation functions in CNNs (e.g. ReLU),

the output of the CNNs will have a non-linear relationship with the input. Compared to the traditional

Cox Proportional Hazards Model which relies on linear relationships, the proposed CNN-Survival may

be a better suit for complex survival patterns. In addition, although the sample size is limited (n=98), the

kernels in CNN-Survival can be optimized using another source domain through transfer learning. Thus,

44

we hypothesized that, CNN-Survival will achieve an acceptable performance using a small resectable

PDAC sample.

45

Chapter 3: Study 1

Title: Prognostic Value of Transfer Learning Based Features in Resectable Pancreatic Ductal

Adenocarcinoma

Authors:

# Name Affiliations

1 Yucheng Zhang 1,2

2 Edrise M. Lobo-Mueller 3

3 Paul Karanicolas 4

4 Steven Gallinger 2

5 Masoom A. Haider 1,2

6 Farzad Khalvati 1,2

Affiliations

1: Department of Medical Imaging, Faculty of Medicine, University of Toronto, Toronto, ON, Canada

2: Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada

3: Sunnybrook Research Institute, Toronto, ON, Canada

4: Department of Surgery, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON,

Canada.

46

3.1 Abstract

Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive cancers with an extremely

poor prognosis. Radiomics has shown prognostic ability in multiple types of cancer including PDAC.

However, the prognostic value of traditional radiomics pipelines, which are based on hand-crafted

radiomic features alone, is limited due to multicollinearity of features and multiple testing problem, and

limited performance of conventional machine learning classifiers. Deep learning architectures, such as

convolutional neural networks (CNNs), have been shown to outperform traditional techniques in

computer vision tasks, such as object detection. They require large sample sizes for training which limits

their development. As an alternative solution, CNN-based transfer learning has shown the potential for

achieving a reasonable performance using datasets with small sample sizes. In this work, we developed

a CNN-base transfer learning approach for prognostication in PDAC patients for overall survival. The

results showed that transfer learning approach outperformed the traditional radiomics model on PDAC

data. A transfer learning approach may fill the gap between radiomics and deep learning analytics for

cancer prognosis and improve performance beyond what CNNs can achieve using small datasets.

3.2 Introduction

Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive malignancies with poor

prognosis (Adamska et al., 2017b; Eibl, 2015). In resectable patients, clinicopathologic factors, such as

tumor size, margin status at surgery, and histological tumor grade have been studied as biomarkers for

prognosis (Ahmad et al., 2001; Ferrone et al., 2012). However, many of these biomarkers can only be

assessed after the surgery and the opportunity for patient-tailored neoadjuvant therapy is lost. Recently,

quantitative medical imaging biomarkers, have shown promising results in prognostication of the overall

survival rate for PDAC patients (Eilaghi et al., 2017).

As a rapidly developing field in medical imaging, radiomics is defined as the extraction and analysis of a

large number of quantitative imaging features from medical images including CT or MRI (Aerts et al.,

2014; Khalvati, Zhang, Wong, et al., 2019). Some radiomic features have been shown to be significantly

associated with clinical outcomes including overall survival (OS) or recurrences in different cancer sites,

such as lung, renal cell carcinoma, and PDAC (Haider et al., 2017; Y. Huang et al., 2016; Klawikowski,

Christian, Schott, Zhang, & Li, 2016; V. Kumar et al., 2013; Parmar, Leijenaar, et al., 2015; Yucheng

47

Zhang et al., 2017). Patients can be further dichotomized using those radiomic features into low-risk and

high-risk groups, guiding clinicians to design personalized treatment plans (Aerts et al., 2014). Although

limited work has been done on radiomics in the context of PDAC, recent studies have confirmed the

potentials for discovering new quantitative image biomarkers for PDAC (Eilaghi et al., 2017).

Despite the recent progress, radiomics analytics solutions have limitations. The first limitation is the

multicollinearity among features. Radiomic features and engineered features are handcrafted and hence,

the driving equations for many of these features are similar, making them highly correlated. As a result,

if one radiomic feature is found to be predictive (or prognostic) for an outcome (i.e., significant), the

similar features will most likely be predictive as well. Consequently, although a large number of

significant features can be found, they are all highly correlated and fail to explain much of the variation

in the outcomes, leading to poor performances.

The second limitation of radiomics is the multiple testing problem. Since thousands of features are tested

at the same time, the chance of facing false positives will increase substantially. Given the p value

threshold as 0.05, testing 100 sets of random numbers with the survival outcome, one would expect to

see five significant features (Type I error). However, many radiomics studies in the literature did not

perform multiple testing control. Therefore, these studies are considered exploratory, and some of the

identified features may be false positives (V. Kumar et al., 2013). These limitations eventually harm the

performance of radiomics based models. Comparatively, deep learning architectures have been shown to

achieve a promising performance for both diagnosis and prognosis.

One of the most well-known architectures for deep learning (neural network) is the convolutional neural

network (CNN) (Schmidhuber, 2014). A CNN performs a series of convolution and pooling operations

to get comprehensive quantitative information from input images. Compared to hand-crafted radiomic

features that are predesigned and fixed, the coefficients of CNN are modified in the training process.

Hence, the final features generated from a CNN are associated with the target outcomes. It has been

shown that deep learning architectures are effective in different medical imaging-related tasks, such as

segmentation for head and neck anatomy and diagnosis for the retinal disease (De Fauw et al., 2018;

Litjens, Kooi, Bejnordi, Setio, et al., 2017; Nikolov et al., 2018).

However, to train a CNN from scratch, millions of parameters (coefficients) need to be tuned. This

requires a large sample size which is not feasible in most medical imaging studies. As an alternative

48

deep learning solution, transfer learning may be more suitable for medical imaging-related tasks since it

can achieve a comparable performance using limited amounts of data (Chuen-Kai Shie, Chung-Hisang

Chuang, Chun-Nan Chou, Meng-Hsi Wu, n.d.).

Network-based transfer learning is defined as taking images from another domain, such as natural

images (ImageNet) to build a pre-trained model and then apply the pre-trained model to the target

images (e.g., CT images of lung cancer) (Ravishankar et al.). The idea of transfer learning is based on

the assumption that the structure of a CNN is similar to the human visual cortex as both composing of

layers of neurons (Pan & Yang, 2009). Top layers of CNNs can extract general features from images,

while deeper layers are able to extract information that is more specific to the outcomes. Moreover,

although typical CNN models contain millions of parameters, most of the coefficients belong to the top

layers. In other words, training top layers require a larger dataset while deeper layers require fewer data.

Transfer learning utilizes this property, training top layers using large pre-trained datasets while

finetuning deeper layers using data from the target domain. For example, the ImageNet dataset contains

more than 14 million images. Hence, pre-training a model using this dataset would help the model

learning how to extract general features using initial layers. Given that many image recognition tasks are

similar, top (shallower) layers of the pre-trained network can be transferred to another CNN model. In

the last step, deeper layers of the model will be trained using the target domain images (Torrey &

Shavlik, n.d.). Since the final (deeper) layers are more target specific, fine-tuning them using the target

domain images may help the model to quickly adapt to the target domain, and hence, improve the

performance.

In the medical imaging field, target data is often small, making it impractical to properly fine-tune the

deeper layers. Consequently, in practice, the top (shallower) layers of a pre-trained CNN can be used as

a feature extractor (D. George et al., 2017; Hertel, Barth, Käster, & Martinetz, 2017; Thomaz et al.,

2017). Given that top layers can capture high-level and informative details from images, passing the

target domain images through these layers allows extractions of features. These features can be further

used to train a classifier for the target domain. This unique process enables building a classifier using a

small target domain.

As discussed above, single institution PDAC datasets are often small (e.g., <100 cases) and hence, are

not suitable for training CNNs from scratch or finetuning deep layers. In this study, we evaluated the

49

prognosis performance of two different transfer learning approaches applied to pre-operative CT scans

for resectable PDAC cases and compared their performance to that of the traditional (engineered)

radiomics feature bank.

3.3 Methods

3.3.1 Dataset

Two cohorts from two different hospitals consisting of 68 (Cohort 1) and 30 (Cohort 2) patients were

enrolled in this retrospective study. All patients underwent curative intent surgical resection for PDAC

from 2007 – 2012 and 2008 – 2013 and did not receive other neo-adjuvant treatment. Pre-operative

portal venous phase contrast-enhanced CT images were used. Overall Survival was collected as the

primary outcome. To exclude the effect of postoperative complications on the prognosis, patients who

died within 90 days after the surgery were excluded. Institutional review board approval was obtained

for this study from both institutions and the need for written informed patient consent was waived.

An in-house developed Region of Interest (ROI) contouring tool (ProCanVAS (Junjie Zhang, Baig,

Wong, Haider, & Khalvati, 2016)) was used by a radiologist with 18 years of experience who completed

the contours blind to the outcome (overall survival). Following the protocol, the slices were contoured

with the largest visible cross section of the tumor on the portal venous phase. When the boundary of the

tumor was not clear, it was defined by the presences of pancreatic or common bile duct cut-off and

review of pancreatic phase images (Eilaghi et al., 2017). An example of the contour is shown in Figure

3.1 below.

50

Figure 3.1. A manual contour of CT scan from a representative patient in cohort 2.

PyRadiomics features were extracted using ROI defined by radiologists’ contour. For transfer learning

feature extraction, we used the same ROI with zero-padding (140*140 pixels in grey scale).

3.3.2 Radiomics feature extraction

Radiomics feature was extracted using the PyRadiomics library (van Griethuysen et al., 2017) (version

2.0.0) in Python. To ensure features were extracted from tumor regions exclusively, voxels with

Hounsfield unit under -10 and above 500 were excluded so that the presence of fat and stents will not

affect the feature values. The bin width (number of gray levels per bin) was set to 25. In total, 1428

radiomic features were extracted for both cohorts (Cohort 1 and 2). Table 3.1 lists different classes of

features used in this study.

51

Table 3.1: List of radiomic feature classes and filters

First-order features Histogram-based features

Second-order texture features Features extracted from Gray-Level Co-

Occurrence matrix (GLCM)

Morphology features Features based on the shape of the region of

interest

Filters No filter, exponential, gradient, logarithm, square,

square-root, local binary pattern

3.3.3 Transfer learning

We used two pre-trained transfer learning models from ImageNet pre-trained ResNet (ImgRes) and

Lung CT pre-trained ResNet (LungRes) (He et al., 2015). Residual Neural Network (ResNet) is a state-

of-the-art deep learning architecture with high classification performance using 34 layers. The ResNet

model avoids the vanishing gradient problems by adding a direct path between layers and skipping one

or more layers in between. This allows a deeper model with better performance.

Two datasets were used to pre-train the ResNet model. The first one is ImageNet, which is an image

database contains 14,197,122 images from 21841 different categories (Deng et al., 2009). The other

dataset is Lung Cancer dataset, which was published on Kaggle with CT images from 888 patients

(Armato et al., 2011). ImageNet pre-trained ResNet was directly available in Keras 2.0 which is a

Python-based deep learning library. We trained LungRes from scratch using lung CT images.

Transfer learning can be done in multiple ways depending on the sample size and the relationship

between pre-trained domains and target domains (Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan

Chou, Meng-Hsi Wu, n.d.; Gu Kim, Choi, & Man Ro, n.d.). As shown in Figure 3.2, when the pre-

trained and target domains are similar, the features are usually extracted from the deeper layers.

Comparatively, when the two domains are different (natural images vs. cancer images), the features are

usually extracted from the shallower layers of the pre-trained network.

52

Figure 3.2. Workflow for transfer learning studies.

A. When pre-train and target domain is different.

B. When pre-train and target domain is similar.

As previously discussed, depending on the similarities between the pre-trained domain and target

domains, transfer learning can be performed in different ways. Given that our target domain data (PDAC

CT images) is small and different from the ImageNet, with transfer learning architecture using ImgRes,

features were extracted from the shallower layer (i.e., 12th layer). For LungRes, since the domains are

similar (CT images from NSCLC and PDAC patients), all the ResNet layers were frozen, and features

were extracted from the final layer (i.e., 34th layer) (Breiman, 2001). In total, 2048 ImgRes and 64

LungRes features were generated.

3.3.4 Feature analysis

To study the feature-wise prognostic value of different feature banks, univariate Cox Proportional

Hazards Model was used to test the association between clinical outcomes and individual features.

Features with Wald test p value smaller than 0.05 were considered as significant.

In Cohort 1, three prognostic models were built using features from three feature banks using Random

Forest classifiers, which had a built-in feature reduction algorithm for selecting best prognostic features

by tuning the number of trees and features at each node. The prognostic values of the three models were

53

evaluated in Cohort 2 using the area under the receiver operating characteristic (ROC) curve (AUC).

Sensitivity tests were applied to test the difference between three ROC curves.

Using these features, these three prognostic models can produce survival probabilities for new patients.

These probabilities can be treated as risk scores and tested for their prognostic power using univariate

Cox Proportional Hazards Model in Cohort 2 (test set). Training and validation datasets were collected

from two different institutions, making the validation process robust and minimizing the potential

overfitting. These analyses were done in R (version 3.5.1) using “caret” , “pROC,” and “survival”

package (Matthias Gamer & Matthias Gamer, 2015; Terry & Therneau, 2018).

3.4 Results

3.4.1 Feature-wise prognostic values

To determine the prognosis value of features from different feature extraction methods, the associations

between individual features and the overall survival were tested using the Wald test in univariate Cox

Proportional Hazards Model in Cohort 1. Among 1,428 PyRadiomics features, 283 features had

significant p values (p value < 0.05). Details of these 283 features were listed in Table A-2 in the

Appendix. Within 2,048 ImgRes features, 49 features had a p value smaller than 0.05. Lastly, for 64

LungRes features, only 2 features were significant.

It is interesting to observe that with respect to feature-wise performance, the PyRadiomics library has a

higher ratio of significant features than those of ImgRes and LungRes feature banks (0.20 vs. 0.024 and

0.031, respectively). However, a high number of significant features does not necessarily lead to a high-

performance prognostic model since many of these features may be correlated. Thus, testing the

performance of the feature banks on a different dataset (i.e., test) is necessary.

3.4.2 Prognostic model performance

To compare the prognostic performance of each of the feature extraction methods for overall survival

for PDAC patients, the prognostic models were trained using all features extracted from Cohort 1 and

54

tested in Cohort 2 using a Random Forest classifier. When using the PyRadiomics feature bank, the

Random Forest model yielded an area under the receiver operating characteristic (ROC) curve (AUC) of

0.57. Using ImgRes feature bank, the model achieved an AUC of 0.71. Finally, using LungRes feature

bank, the AUC reached 0.74.

The AUCs of both transfer learning methods are higher compared to that of PyRadiomics. Comparing

the ROC curves using the sensitivity test (DeLong, DeLong, & Clarke-Pearson, 1988), there was no

significant difference between ROCs of PyRadiomics vs. ImgRes and ImgRes vs. LungRes.

Nevertheless, LungRes feature bank had significantly higher performance than that of PyRadiomics

feature bank with a p value of 0.03. This result indicates that the transfer learning model based on lung

CT images (LungRes) significantly improves the prognostic performance of the model compared to

traditional radiomics methods (e.g., PyRadiomics). Figure 3.3 shows the ROC curves for three models

Figure 3.3. A: ROC curve using PyRadiomics feature bank only (AUC = 0.57), B: ROC curve with

ImgRes feature bank (AUC = 0.71), C: ROC curve for LungRes feature bank (AUC = 0.74).

3.4.3 Risk score

Risk scores were generated by three prognostic models for patients in Cohort 2. In univariate Cox

Proportional Hazards Model, PyRadiomics and ImgRes prognostic models had p values of 0.23 and

0.253 for the risk scores. The LungRes prognostic model was the best model yielding a p value of

0.0395 for the risk factor, indicating that transfer learning architecture pre-trained by lung cancer images

55

can produce a prognostic risk factor for PDAC patients. The hazard ratio (HR) and confidence intervals

(CI) for risk scores generated by the PyRadiomics, ImgRes, and LungRes prognostic models were HR =

1.41 (CI: 0.80 – 2.55), HR = 1.31 (CI: 0.81 – 2.12), and HR = 1.78 (CI: 1.34 – 2.35), respectively (Table

3.2). Using the risk scores, if we dichotomize patients in Cohort2 into high risk and low-risk groups, the

LungRes transfer learning prognostic model yields the best separation in terms of the survival patterns.

Figure 3.4 shows the Kaplan–Meier plots for the risk factors of the PyRadiomics, ImgRes, and LungRes

prognostic models.

Figure 3.4. Kaplan-Meier plots for OS in Cohort 2.

A. PyRadiomics based risk score (P=0.23)

B. ImgRes based risk score (P=0.253)

C. LungRes based risk score (P=0.0395)

56

Table 3.2: List of hazard ratios and p values for risk scores for prognostication of overall survival in the

validation cohort

Prognostic Model p value Hazard Ratio (HR) and

Confidence Interval (CI)

Engineered Radiomic

Features P= 0.23

HR = 1.41

CI: 0.80 – 2.55

ImgRes P = 0.253

HR = 1.31

CI: 0.81 – 2.12

LungRes P = 0.0395

HR = 1.78

CI: 1.34 – 2.35

Abbreviations: CI: confidence interval; ImaRes: Deep transfer learning model pre-trained by ImageNet

(natural images). LungRes: Deep transfer learning model pre-trained by lung CT images.

3.5 Discussion

In this study, we developed and compared three prognostic models for overall survival in resectable

PDAC patients using the PyRadiomics and deep radiomics features banks pre-trained by natural images

and lung CT images. The lung CT pre-trained transfer learning model achieved significantly better

prognosis performance compared to traditional radiomics approach. The PyRadiomics feature bank had

a higher proportion of significant features compared to the other two transfer learning feature extractors

(20% vs. 2.4% and 3.1%). However, these features are correlated, and a higher number of significant

features are mostly due to the multicollinearity among the engineered features. Hence, the majority of

these hand-crafted features carry redundant predictive information (Toloşi & Lengauer, 2011). In

addition, due to the multiple testing problem, some significant features may be false positives. Hence,

they failed to provide prognostic information to the model. These two shortcomings of engineered

57

radiomic features (multicollinearity and multiple testing problem) become more acute when a prognostic

model is built using all features. As a result, the final risk score produced by the model is not prognostic

of the outcome (e.g., P=0.23). The risk score generated by the transfer learning model pre-trained by

natural images is not significant either (P=0.253). This was expected due to the substantial difference

between natural images and PDAC CT images. The best prognostic performance was achieved by the

transfer learning model pre-trained by lung CT images with a p value of 0.0395. This indicates that a

pre-trained CNN, which acts as feature extractor, can generate informative features and provide

prognosis information. It is worth to note that, the HR for LungRes risk score is higher than that of

CA19-9 in PDAC prognosis.

This study showed the potential of transfer learning in a typical small sample setting. If Cohort 1 (PDAC

cases alone) was used to train a CNNs from scratch with no pre-training, and then tested on cohort 2, the

final output would not provide any prognostic value (AUC of ~0.50). Transfer learning, unlike

conventional deep learning methods which need large datasets, can achieve acceptable performance

using a limited number of samples, making it suitable for most medical imaging studies. As the power of

quantitative medical imaging via deep learning is recognized in the research community, the imaging

data is rapidly growing. Nevertheless, the amount of data required for training a CNN from scratch to

achieve meaningful results is far beyond the capacities of most of the existing databases. Thus, transfer

learning can play a key role in applying deep learning to medical imaging studies.

As a powerful prognostic model, deep transfer learning is not limited to only predicting binary survival.

It can also be used to predict patients’ outcomes for given time intervals (e.g. 5 years). Although we

used the Cox Proportional Hazards Model on the risk score and reported hazard ratios, this was done

independently. The final prognostic model itself can only provide binary prognostications. In the

following studies, we aimed to integrate the Cox Proportional Hazards Model into deep transfer learning

approach to enable simultaneous training of both Cox Proportional Hazards Model and transfer learning

models based on binary outcome and survival time data. This generative prognostic model will have an

improved performance when compared to that of the existing model since the features it generated are

associated with not only the binary outcome but also the survival duration. Recent work on these

generative models using conventional CNNs (e.g., DeepSurv (Katzman et al., 2016)) confirms the

potential for the proposed model.

58

Although deep transfer learning outperforms the engineered radiomics model, one must not assume that

radiomic features should be discarded altogether. In fact, these hand-crafted features have been shown to

be prognostic of survival in different cancer sites (Aerts et al., 2014; Gillies et al., 2015; Parekh &

Jacobs, 2016). Thus, in future studies, using feature fusion techniques that combine engineered radiomic

features with deep transfer learning model has merit. Feature fusion is a technique to fuse two sets of

features while retaining their information (Mangai, Samanta, Das, & Chowdhury, 2010). It has been

shown that feature fusion can further improve the prediction accuracy in image classification tasks (Sun,

Zeng, Liu, Heng, & Xia, 2005). An optimal feature fusion method which combines engineered radiomic

features with deep transfer learning features may further improve the overall performance of the

prognostic model.

One limitation of the present study is the small dataset of the target domain (PDAC). A larger dataset

would allow us to further investigate the effectiveness of transfer learning and whether there exists a

threshold for data size to improve performance for the transfer. In future work, using a larger dataset, we

will address this research question, which will deepen our understating of deep learning and its

applicability to medical imaging for prognostication of cancer.

3.6 Conclusion

Deep transfer learning has the potential to improve the performance of prognostication for cancers with

limited sample sizes such as PDAC. In our resectable PDAC cohorts, Deep transfer learning models

outperformed conventional and engineered radiomic models.

59

Chapter 4: Study 2

Title: Improving Prognostic Performance through Radiomics and Deep Learning Features Fusion

in Resectable Pancreatic Ductal Adenocarcinoma

Authors:

# Name Affiliations

1 Yucheng Zhang 1,2






Affiliations





Canada.

60

4.1: Abstract

Radiomics, as an analytic pipeline for quantitative imaging feature extraction and analysis, has grown

rapidly in the past few years. Recent studies in radiomics aimed to investigate the relationship between

tumors imaging features and clinical outcomes. Open source radiomics feature banks enable the

extraction and analysis of thousands of pre-defined features. On the other hand, deep learning

approaches have also shown their potentials in the quantitative medical imaging field, providing even

more imaging features. However, the large dimension of features in medical imaging studies has become

an obstacle due to multicollinearity and multiple testing problems. In this study, CT images from

resectable Pancreatic Adenocarcinoma (PDAC) patients were used to compare the prognosis

performance of common feature reduction and fusion methods. It has been shown that the risk-score

based feature fusion and reduction method significantly improves the prognosis performance for overall

survival in resectable PDAC cohorts, elevating the Area under ROC curve (AUC) from 0.74 to 0.83.

4.2: Introduction

Radiomics features are designed to decode the predictive information in medical images for cancer

patients. As a quantitative approach, radiomics involves the extraction and analysis of quantitative

medical imaging features and establishing correlations between these features and clinical outcomes

such as patient survival (Aerts et al., 2014; Khalvati, Zhang, Wong, et al., 2019; V. Kumar et al., 2013).

Several radiomic features have been found to be significantly associated with various clinical outcomes

in multiple cancer sites such as lung, pancreas, and kidney (Aerts et al., 2014; Eilaghi et al., 2017;

Gillies et al., 2015; Haider et al., 2017; Oikonomou, Khalvati, & et al., 2018; Parmar, Leijenaar, et al.,

2015; Yip & Aerts, 2016).

In the past few years, the pipeline for traditional radiomics analysis has been established (Parekh &

Jacobs, 2016). As discussed in Chapter 1, the traditional pipeline consists of four steps: image

acquisition, segmentation, feature extraction, and model building. The core of traditional radiomics

studies relies on the extraction of a set of engineered and hand-crafted features based on pre-defined

mathematical formulas. These engineered features, which are extracted from regions of interest

annotated by clinicians, have been designed to capture different characteristics of images. For example,

the first order features measure the distribution of pixel intensities while second-order features based on

grey-level co-occurrence matrix (GLCM) extract texture information. Efforts have been made to

61

standardize the feature banks by implementing open source libraries such as PyRadiomics (van

Griethuysen et al., 2017). In this feature bank, thousands of engineered features from different classes of

features can be extracted from 2D or 3D medical images. These features can be tested for associations

with clinical outcomes such as overall survival, recurrence, or genetic mutations (Mazurowski, 2015).

Several cross-cohort and multi-centre studies have shown that serval PyRadiomics features are robust to

different scanners and clinician annotations (Aerts et al., 2014; Khalvati, Zhang, Baig, et al., 2019; B.

Zhao et al., 2016).

Despite the recent progress, traditional radiomics analytics pipeline has few drawbacks. First, the

formulas of features are pre-defined, and can be very similar. This leads to high correlations among

different features. As a result, if a feature is found to be significantly associated with a certain clinical

outcome, highly correlated features will more likely to be significant as well. Consequently, while the

high dimension of features increases the complexity and computational power requirements, there is no

corresponding increase in the prognosis performance. Second, testing radiomic features one by one

increases the chance of producing false positives. Several radiomics studies lack multiple testing control

and hence, some discovered significant features may be the result of type I errors (Yip & Aerts, 2016).

Third, many hand-crafted features were not specifically been designed for medical images and related

tasks. For different medical imaging modalities and tumor phenotypes, hand-crafted features lack the

flexibility to adapt to various images and clinical outcomes. These shortcomings in the traditional

radiomic analytics pipeline have inspired new research which takes advantage of the recent impressive

progress in deep learning and convolutional neural networks to improve the performance of the

predictive models.

Convolutional neural networks (CNNs) are one of the most frequently used deep learning architectures

in computer vision tasks (Krizhevsky et al., 2012). CNNs apply a series of convolution operations on

input images preserving the spatial relationship between pixels and mapping these relationships on to

outputs. During the training phase, parameters of the convolution operations are tuned. Consequently,

the kernels will be updated, so that they can capture information specifically related to the classification

task (e.g., outcome prediction) at hand. In medical imaging, this allows researchers to generate

customized feature maps for specific modality or diseases, and further improves performance (R.

Yamashita et al., 2018). However, training these parameters requires a large sample size, which is

62

usually not available in a typical medical imaging research setting. To overcome this limitation, transfer

learning-based feature extraction has been proposed (Pan & Yang, 2010).

Transfer learning was developed based on an assumption that the structures of CNNs are similar to the

mechanism of human visual cortex (Ravishankar et al.). The top layers of CNNs can extract general

features from images, while the deeper layers are more specific to the target. Pre-training CNNs using

large image datasets such as ImageNet helps the model learning how to extract general features. Since

many image recognition tasks are similar, the top layers of the network can be transferred to another

target domain (Tan et al., 2018). On the other hand, deeper layers of CNNs can extract “higher-order”

information which is associated with the target outcome. Thus, if the target domain is similar to the pre-

trained domain, deeper layers can be transferred to extractor features.

In practice, depending on the level of similarity between the target domain and the source domain,

transfer learning features can be extracted from different layers. Training classification models using

those transfer learning features generally requires fewer sample sizes. As discussed above, training a

CNN from scratch requires a large amount of sample size and longer computational time.

Comparatively, transfer learning offers a solution, and enables the application of CNNs in the medical

imaging domain.

Deep learning and transfer learning-based feature extraction have shown promising results in cancer

assessment (Lao et al., 2017). Several radiomic features are widely recognized for their effectiveness in

cancer prognosis as well (Aerts et al., 2014; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019;

Oikonomou et al., 2018). Furthermore, it has also been shown that, combining pre-defined features with

deep learning-based features further improved the performance (Lao et al., 2017). Hence, it is crucial to

develop a feature reduction method which can fuse the predictive power of deep radiomics with pre-

defined radiomic features to achieve optimal performance.

Traditional feature reduction methods can be classified into two groups: supervised and unsupervised

feature reduction. The main difference between the two is that, unsupervised methods reduce features

based on the characteristics of features regardless of the outcome. Comparatively, supervised methods

rely on the association between features and the outcome.

63

Principle Component Analysis (PCA) is a common unsupervised feature selection method, which uses

an orthogonal transformation to convert a set of observations of possibly correlated variables into a set

of linearly uncorrelated variables, known as principal components (Abdi & Williams, 2010). These

components can explain most of the variation in the original features and retain that information while

reducing the number of features.

For binary outcomes, supervised feature selection methods usually compare the distributions of features

for positive and negative groups. If these two groups have a significant difference in terms of their

values, the feature is considered to be predictive. As a supervised method, Boruta algorithm, which is a

wrapper built around the Random Forest classification algorithm (Kursa & Rudnicki, 2010), tries to

capture all the important features with respect to an outcome. First, it duplicates the dataset and shuffles

the values in each column, generating random features which have a similar distribution like the original

features. Then, it tests the performance of these random features. The best performing random feature

will be set as a benchmark, where all the real features performing worse than this will be eliminated

(Kursa & Rudnicki, 2010). After several iterations, a set of significant features will be generated through

this algorithm. Although in supervised feature reduction algorithms multiple testing issue is inevitable

(Yip & Aerts, 2016), models based on Boruta feature selection method are less prone to this problem

since it has built-in multiple testing corrections.

In this paper, first, we compare the performance of three feature reduction methods: PCA, Boruta, and

Cox Proportional Hazards Model (CPH). These are applied to the combined feature set of pre-defined

and deep radiomic features. We then propose a feature reduction and fusion method, which combines

the predictive power of pre-defined and deep radiomic features and produces a single risk score. Our

results illustrate that the proposed feature fusion and reduction method significantly improves the

performance of the model for the prognostication of overall survival of PDAC patients when compared

to traditional feature reduction models (PCA, Boruta, and CPH).

64

4.3 Methods

4.3.1 Dataset

Two cohorts from two different hospitals consisting of 30 and 68 patients were enrolled in this

retrospective study. All the patients underwent curative intent surgical resection for PDAC from 2007 –

2012 and 2008 – 2013 and did not receive other neo-adjuvant treatment. Contrast-enhanced CT images

were obtained pre-operatively. Overall survival data were collected as the primary outcome. To exclude

the effect of post-operative complications on the prognosis, the patients who died within 90 days after

surgery were excluded. Institutional review board approval was obtained for this study from both

institutions. An in-house developed region of interest (ROI) contouring tool (ProCanVAS) was used by

an experienced radiologist (Junjie Zhang et al., 2016). The reader contoured the ROIs blind to the

outcome. A cohort with 68 patients from one institution was used as the training cohort while another

cohort with 30 patients from a different institution was used as the test cohort.

4.3.2 Radiomics Feature Extraction

Pre-defined radiomic features were extracted using the PyRadiomics library (version 2.0.0) in

Python(van Griethuysen et al., 2017). To ensure that features were extracted from tumor regions

exclusively, voxels with Hounsfield unit (HU) < -10 and > 500 were excluded to eliminate fat and stents

from the feature values. In total, 277 radiomic features were extracted for both cohorts. Details of these

features are listed in the table 4.1 below.

65

Table 4.1: Number of features extracted from different filters

Image filters/ Feature First order glcm

Original 18 23

Logarithm 18 23

Square root 18 23

Square 18 23

LBP-2D 56 0

Gradient 18 23

Exponential 16 0

4.3.3 Transfer Learning Feature Extraction

We used two transfer learning models including the ImageNet pre-trained ResNet (He, Zhang, Ren, &

Sun, 2016) (ImgRes) and the Lung CT pre-trained ResNet (LungRes) (He et al., 2015). ResNet (He et

al., 2016) (Keras-inception-resnet-v2) was chosen since it is a state-of-art deep learning architecture

with high classification performance. Through adding direct paths, the ResNet model avoids the gradient

vanishing problem and achieves better performance.

Two datasets were used to pre-train the ResNet model. The first one is ImageNet (Deng et al., 2009),

which contains 14,197,122 natural images from 21,841 different categories. The second dataset is Non-

Small Cell Lung Cancer (NSCLC) dataset, which was published on Kaggle with CT images from 888

patients (Aerts et al., 2014). ImageNet pre-trained ResNet was directly available in Keras 2.0 which is a

python- based deep learning library (Chollet & Others, 2015). The LungRes CNN was trained from

scratch using the lung CT images.

The process of transfer learning varies depending on the similarity of the pre-trained domain and target

domain. Since our target domain (pancreatic CT) is small and different from the pre-trained domain

66

(ImageNet - natural images), during transfer learning process using ImgRes, features were extracted

from a shallower layer (12th layer). For LungRes, since the pre-trained and target domains are rather

similar (lung and pancreatic CT), features were extracted from the final layer before the classifier. In

total, 2048 ImgNet features and 64 LungRes features were extracted.

4.3.4 Correlation

To investigate the correlation between the features extracted using traditional radiomics pipeline

(PyRadiomics) and transfer learning approaches (ImgRes, and LungRes), Pearson correlation

coefficients were calculated for each pair of feature sets in the training cohort (n=68) (Sedgwick, 2012).

Mean correlation coefficient was calculated for each combination of the three different feature

extraction methods (PyRadiomics, ImgRes, and LungRes). The distributions of the correlation

coefficients were also calculated.

4.3.5 Proposed Prognosis Model

To investigate the optimal feature reduction and fusion methods, we trained four prognosis models using

CT images from Cohort 1 (n=68) and validated them in Cohort 2 (n=30). Figures 4.1-A, 4.1-B, and 4.1-

C shows the prognosis model using three traditional feature reduction algorithms; PCA, CPH, and

Bortua. In each model, the three feature banks (PyRadiomics, ImgRes, and LungRes) were concatenated

together. Then, the feature reduction algorithm was applied to these features. The remaining features

were used to train the Random Forest classifier using the training cohort, with the derived model

validated in the test cohort which was collected in an independent hospital site. For the CPH method, p

value <0.05 was used as a feature selector.

Our proposed risk score-based method is illustrated in Figure 4.1-D. First, using the training cohort,

three different Random Forest models were trained separately using each of the three feature banks

(PyRadiomics, ImgRes, and LungRes). Each of these models was then used to produce the probability

for every patient in the training cohort through 10-fold cross-validation. We treated these probabilities as

new features, based on which, the final prognosis model was built through another Random Forest

classifier. In testing, for each patient, three probabilities were generated using the three models. Next,

these probabilities were fed into the final prognosis model, which provided the final risk score.

67

Figure 4.1. Pipelines for different feature fusion methods.

A. Unsupervised feature fusion using PCA. Features from three feature banks will be fused using PCA,

generating few components. Later, these components were used to build a model in the training cohort. In

the end, the performance of the model would be evaluated in the validation cohort.

B. Supervised feature reduction using Boruta. Boruta will identify prognostic features which will be used to

build a prognosis model in the training dataset. Its performance will be validated in the testing cohort.

C. Supervised feature reduction using Cox-Regression. Each feature was tested using univariate Cox-

regression. Significant features will be used in building a prognosis model, which will be validated in the

validation cohort.

D. Risk-score based feature fusion method. Three prognosis models were built using features from three

feature banks. The prediction outputs of those models were considered as risk-scores. Hence, for every

patient, there will be three risk-scores. Later, another model was trained using these risk-scores in the

training set and validated in the testing cohort.

68

Area under ROC curve (AUC) (Fawcett, 2005) was used to measure the performance of these four

approaches. The ROC-based specificity tests were applied to test the difference between the AUCs of

different models (DeLong et al., 1988). These analyses were performed through “pROC” package in R

(Version 3.5.1)

4.4 Results

4.4.1 Correlation Analysis Between Pre-defined and Deep Radiomic Features

Within each feature bank, the average absolute value of Pearson correlations coefficients of 277

PyRadiomics features was 0.32, while ImgRes (2048 features) and LungRes (64 features) had mean

correlations of 0.24 and 0.27 respectively. This showed that PyRadiomics features had a higher

correlation among each other compared to deep radiomic features. The cross-correlation of PyRadiomics

and ImgRes features yielded a mean absolute coefficient of 0.18, which was the same for PyRadiomics

vs. LungRes features. The two deep transfer learning-based features banks (ImgRes and LungRes) had a

slightly higher mean correlation coefficient of 0.22. Table 4.2 below summarizes the correlation results.

Table 4.2: Absolute Pearson correlation coefficient between features from each feature extraction

method

PyRadiomics (277) ImgRes (2048) LungRes (64)

PyRadiomics (277) 0.32 0.18 0.18

ImgRes (2048) 0.18 0.24 0.22

LungRes (64) 0.18 0.22 0.27

69

Figure 4.2. Correlation heatmap of three different feature extraction methods.

The heatmap in Figure 4.2 shows the correlation details. Each dot in Figure 4.2 represents a correlation

coefficient. White colour means that the coefficient is 0, while red and green dots represent positive or

negative correlations. There are colour blocks in PyRadiomics versus PyRadiomics region, indicating a

high correlation among the PyRadiomics features. The colour is lighter in ImgRes vs. PyRadiomics, and

LungRes vs. PyRadiomics regions, showing that the correlation coefficients are lower across these

feature banks.

The distribution of the correlation coefficients (in absolute value) are also displayed in histogram form

in Figure 4.3 for PyRadiomics vs. ImgRes and PyRadiomics vs. LungRes. As illustrated by skewed

distributions, most of the pre-defined and deep radiomic features have no or weak correlation among

each other. However, it is clear that few features have high correlations with coefficients higher than 0.7

(Mukaka, 2012). This result indicates that, some deep transfer learning features (deep radiomic

features) could resemble properties of certain pre-defined radiomic features. As an example, ImgRes

70

feature “v620” had a correlation coefficient of 0.86 with PyRadiomics feature

“gradient_firstorder_RootMeanSquared”, and 0.83 with “gradient_firstorder_TotalEnergy”.

Figure 4.3. Histogram of Pearson correlation coefficients.

A. Correlation coefficients from PyRadiomics and ImgRes.

B. Correlation coefficients from PyRadiomics and LungRes

4.4.2 Prognosis Performance of the Proposed Prognosis Model

As shown in Figure 4.2, the performances of three feature reduction methods (PCA, Boruta, and CPH)

were compared to that of the proposed risk score-based prognosis model.

71

PCA method generated 41 components to represent the variance in the original 2,389 features of the

combination of the PyRadiomics, ImgRes, and LungRes feature banks. Boruta feature reduction method

selected 2 features in 1,000 iterations, with a cut off at 0.1 (p value cut off for Boruta method). CPH

method identified 115 features associated with overall survival in the training cohort. Particularly, 55 of

them belong to the PyRadiomics feature bank, while 58 were extracted using ImgRes. LungRes also

contributed another two features. The proposed risk score-based model generated a single risk score

using the probabilities of the three individually trained Random Forests classifier based on the

PyRadiomics, ImgRes, and LungRes feature sets. The AUC for each method was calculated on the test

cohort.

The AUC for PCA, Boruta, and CPH methods were 0.72, 0.56, and 0.66, respectively. The proposed risk

score-based method produced the highest AUC of 0.83. Comparing the feature reduction methods using

specificity test, the performance of the proposed risk score-based method was significantly higher than

PCA (p value = 0.049), Boruta (p value = 0.0015), and Cox-Regression methods (p value = 0.015). The

results suggested that a stacking model, which based on probabilities calculated by multiple individual

small models, gave the best performance compared to other models. The ROC curves for three

traditional feature reduction methods (PCA, Boruta, and CPH) and the proposed risk score-based model

are shown in Figure 4.4, and summaries are listed in Table 4.3.

Table 4.3: Summary table for models using four feature reduction methods.

PCA Boruta Cox-Regression Risk-score

AUC 0.72 0.56 0.66 0.83

p value

Compared to the ROC of

the Risk-score method

0.049 0.0015 0.015

72

Figure 4.4. ROC curves of models using four feature reduction methods.

A. ROC curve for PCA based fusion method, AUC = 0.72.

B. ROC curve for Boruta based feature reduction method, AUC = 0.56.

C. ROC curve for CPH based feature reduction method, AUC = 0.66.

D. ROC curve for risk-score based feature fusion method, AUC = 0.83.

73

4.5 Discussion

In this study, we proposed a novel risk score-based feature reduction and fusion method for a prognosis

model and compared it to three different feature reduction methods in PDAC CT settings for pre-defined

radiomics and deep transfer learning feature banks. We discovered that, the proposed risk score-based

method (stacked model) had a better prognosis performance than those of traditional supervised and

unsupervised methods. This result is consistent with previous studies that ensemble methods can

outperform traditional machine learning models (Breiman & Leo, 1996; Dietterich, 2000; Rokach,

2005). Although each individual model based on PyRadiomics, LungRes and ImgRes is not strong, the

final model had relatively better performance.

As transfer learning increasingly plays a vital role in medical image analysis, the curse of dimensionality

is becoming more acute in radiomics-based prognosis models (Lao et al., 2017). Supervised feature

reduction methods such as univariate CPH and Boruta have difficulties in balancing false positive rate

and statistical power. By testing 277 features using univariate CPH, the probability of having at least one

false positive is higher than 99%. Hence, supervised feature reduction methods lose their significances

as feature banks continuing to grow. In addition, unsupervised methods including PCA and Independent

Component Analysis (ICA) are not able to boost the prognosis performance due to the inherent noise in

images features. On the other hand, ensemble methods, which use multiple models to generate risk

scores, may overcome these limitations of the traditional feature reduction methods. Additionally, since

risk-scores were generated using the non-linear classifier Random Forest, these risk scores were non-

linear mappings of the original feature space, which provide better fits for patients’ survival patterns. In

our study, using PDAC CT images, the proposed stacked methods have significantly higher AUC

compared to other feature fusion and reduction methods including PCA (p value = 0.049), Boruta (p

value = 0.0015), and Cox-Regression (p value = 0.015).

It is worth to note that although most deep radiomics features are independent from engineered

PyRadiomics features, there exist significant Pearson correlations coefficients between certain deep

radiomics and PyRadiomics features. This result suggests that the relationship between deep radiomics

and PyRadiomics is complementary. Since most deep radiomics features do not have a linear

relationship with engineered radiomics features, fusing these two feature banks would provide more

information to the prognosis model. On the other hand, existing correlations between first-order

74

radiomics and deep radiomics features suggest that through backpropagation, pre-trained CNNs were

also able to capture associations between first-order features and patients’ outcomes.

Although the proposed ensemble method outperforms traditional approaches, it has limitations.

Compared to supervised methods where certain biomarkers can be identified during the process,

ensemble methods are hard to interpret since the stacked model is based on the results (probabilities)

from other models. Although using intuitive algorithms such as logistic regression instead of Random

Forests, the final prognosis probability (risk score) can be derived from original features using

mathematical formulation, it would be a complicated task. In addition, the current models provide binary

outcomes as final outputs, ignoring the time to the event. It would be better to include time duration to

further improve the prognosis model.

4.6 Conclusion

We compared the proposed risk score-based prognosis model to three traditional feature reduction

methods and found that the proposed ensemble method has the best performance in prognostication

tasks for resectable PDAC patients, elevating the AUC from 0.74 to 0.83. The proposed model exploits

the state-of-the-art deep transfer learning methods and combines them with pre-defined radiomic

features to significantly improve the prognostic performance.

75

Chapter 5: Study 3

Title: CNN-based Survival Model for Pancreatic Ductal Adenocarcinoma in Medical Imaging

Authors:

# Name Affiliations

1 Yucheng Zhang 1,2






Affiliations





Canada.

76

5.1 Abstract

Cox proportional hazard model (CPH) is commonly used in clinical research for patient survival

analysis. However, the underlying linear assumption of CPH model limits its performance. In

medical imaging, the radiomics pipeline, which is based on imaging feature extraction and

analysis, is used in combination with the CPH model for survival analysis. Nevertheless, the

multicollinearity of radiomic features and multiple testing problem further impedes the

performance such models. In this work, a convolutional neural networks (CNNs) based survival

model was built and tested in a typical small dataset setting in resectable PDAC cohorts (n=98).

The CNNs-based survival model outperforms the traditional CPH-based radiomics approach in

terms of concordance index by 42%, providing a better fit for patients’ survival patterns.

5.2 Introduction

As a statistical method, survival analysis is used in clinical research to identify potential risk factors

or biomarkers for a variety of clinical outcomes including patient survival for different diseases

such as cancer. Cox proportional hazard model (CPH) is one of the most commonly used survival

analysis tools (Fox & Weisberg, 2011; B. George, Seals, & Aban, 2014). CPH is a semiparametric

model that calculates the effects of features (independent variables) on the risk of a certain event

(e.g., death) (Cox, 1972). For example, CPH measures the effect of tumor size on the risk of death.

The CPH-based survival models can help clinicians make more customized (personalized)

treatment decisions for individual patients. However, CPH models assume that the independent

variables (features or biomarkers) make a linear contribution to the model, with respect to time. In

many conditions, this assumption oversimplifies the relationships between biomarkers and

outcomes, especially in cancer diseases with poor prognosis. With a limited sample size, the

violation of linear assumption is not obvious and may be overlooked. However, as data sizes

increase, the violation of linear assumption in CPH models increasingly becomes more obvious and

problematic diminishing the reliability of such models (Kattan, Hess, & Beck, 1998).

In most cases, non-linear risk models can provide a better fit for survival function. There are mainly

three types of non-linear survival models: (i) classification methods, (ii) time-encoded methods,

77

and (iii) risk-prediction methods (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).

Classification methods solve the nonlinearity by using a classifier including such as Random Forest

or Support Vector Machine (SVM). Although these classifiers perform well in nonlinear scenarios,

they discard the duration information in modelling, which may lead to unreliable model.

For diseases with poor prognosis, classification methods are also prone to biased predictions due to

imbalance outcomes (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Time-encoded methods

separate a long time-interval into multiple fragments and make predictions for each segment.

However, the performance of time-encoded models is usually not comparable to traditional CPH

models because they are based on multinomial classification and take the duration into account only

partially. Risk-prediction models which are based on artificial neural networks (ANNs) learn

complex and nonlinear relationships between prognostic features and an individual’s risk for a

given outcome. Therefore, the ANNs-based model can provide an improved personalized

recommendation based on the computed risk.

Nevertheless, previous studies have demonstrated mixed results on ANNs performance in survival

analysis where it has been shown that in many cases, ANNs have not outperformed standard

methods for survival analysis (Mariani et al., 1997; Sargent, 2001; Xiang, Lapuerta, Ryutov,

Buckley, & Azen, 2000). This may be due to the small sample size and limited feature space leading

to ANNs models that are underfitted. To exploit the ANNs architecture and successfully apply them

to complex cases, larger datasets are required. Recent work has shown that, given enough sample

sizes, ANNs can, in fact, outperforms traditional CPH survival models (Ching, Zhu, & Garmire,

2018; Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).

In medical imaging, researchers have been working to extract diagnostic or prognostic features from

medical images in different modalities (V. Kumar et al., 2013; van Griethuysen et al., 2017; Yip

& Aerts, 2016). Efforts have been made to standardize these quantitative imaging features

(radiomic) by implementing open source libraries such as PyRadiomics (van Griethuysen et al.,

2017). These feature banks contain thousands of hand-crafted formulas, designed to extract the

distribution or texture information. Subsequently, these features are often tested by CPH models

selecting significant features and building the final survival model (Y. Huang et al., 2016; Lao et

78

al., 2017). However, the high dimensionality nature of radiomics features introduces serious issues

in feature reductions and prognosis performance.

Through a standard radiomics feature bank, more than 1000 features can be extracted from ROI.

Given the high dimensionality of features, multiple testing in CPH models becomes a challenge

(Yip & Aerts, 2016). In addition, the proposed feature sets are often highly correlated due to the

similarity of formulas. Despite the linear assumption in CPH modelling, the multicollinearity in the

feature space further impedes the performance. The limitations of the handcrafted radiomic features

and the fact that ANNs outperform traditional CPH models, and given the recent advances in deep

learning, motivate designing a novel approach for survival modelling that combines CPH with state-

of-the-art deep learning algorithms for improved performance.

Previous work on deep learning based for survival analysis including DeepSurv and NNET-survival

are all ANNs-based survival models with modified loss function to capture more accurate survival

patterns (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016). These models take features (e.g.,

radiomic features) as input and return risks for patients at a given timepoint. However, as discussed

above, feeding radiomics features into these ANNs as input is not the optimal solution due to the

multicollinearity issue. In this research, we use medical images as input, replacing conventional

feature extractors with a Convolutional Neural Networks (CNNs) architecture to extract disease-

specific image features which are associated with survival patterns. We hypothesized that CNNs

will extract more meaningful features and combined with nonlinear loss function, the proposed

approach will provide a better fit for survival patterns.

As the most well-known architecture in deep learning, CNNs recognize imaging features by

applying multiple layers of convolution operations to the images (B, 2013; Litjens, Kooi, Bejnordi,

Setio, et al., 2017) where the weights of the convolution filters are finetuned during training via

backpropagation process (Horn, Auret, McCoy, Aldrich, & Herbst, 2017). Thus, given sufficient

data, CNNs can be used to extract imaging features that are disease-specific, which can be used for

diagnosis or prognosis purposes (Yosinski et al., 2014). Although traditional medical imaging-

based CNNs use binary or multinomial classification loss function, similar to ANNs-based survival

models, the loss function can be modified to also capture the survival patterns by accounting for

survival duration. By doing so, CNN can be tuned to extract features that are associated with the

79

risk of the outcome in a certain duration. We hypothesized that the proposed CNN-based Survival

(CNN-Survival) model with ANNs loss function will outperform conventional radiomics and CPH-

based prognosis model.

5.3 Methods

5.3.1 Data

In order to gather sufficient data to train the proposed CNN-Survival model, CT scans along with

patient outcome (survival and time to death) from three cohorts were extracted. Cohort 1 consists

of 422 Non-small cell lung cancer (NSCLC) patients (Ganeshan et al., 2012). We used this data to

pre-train the CNNs since it has the largest sample size. Cohort 2 has 68 pancreatic adenocarcinoma

(PDAC) patients, which was used to finetune the final layers in the proposed CNN-Survival

architecture. Cohort 3, which is the test data, consists of 30 PDAC patients enrolled in another

independent hospital site (Eilaghi et al., 2017). For all the patients in these three cohorts, CT scans,

annotations (contours) of tumor performed by radiologists, and survival data were available. The

institutions’ Research Ethics Boards approved these retrospective studies and waived the

requirement for informed consent. All methods were carried out in accordance with relevant

guidelines and regulations.

5.3.2 Architecture of the proposed CNN-Survival

A CNN architecture with six-layered convolutions (CNN-Survival) was trained using images from

Cohort 1 as shown in Figure 5.1. Input images have dimensions of 1401401 (grey scale), which

contain the CT images within the manual contours of the tumors (example shown in Figure 5.2).

The first two convolutional layers have kernel size 33 with 32 filters. After a max pooling layer

of size 22, features passed through another two convolutional layers with same kernel size but 64

filters. In the end, passing through another max pooling layer, features went through the final two

convolution layers which have 128 filters. To avoid overfitting with this small sample size, dropout

layers were added after every two convolutional layers. Finally, passing through the flatten layer,

80

images were converted into 25, 088 features, where survival probabilities for a given time t were

calculated.

After training with Cohort 1, all the Conv-2D layers of the pre-trained model were frozen as feature

extraction layers. During the transfer learning process, the last dense layer would be finetuned by

PDAC images from Cohort 2.

5.3.3 Loss Function

To better fit the distribution of survival data, a modified loss function, proposed in (Gensheimer

& Narasimhan, n.d.), was applied to the CNNs architecture (Equation 1).

𝑙𝑜𝑠𝑠 = − ∑ ln(ℎ𝑗𝑖) − ∑ l n(1 − ℎ𝑗

𝑖)𝑟𝑗

𝑖=𝑑𝑗+1 𝑑𝑗

𝑖=1 (1)

In the formula above, ℎ𝑗𝑖 is the hazard probability for individual i during time interval j. 𝑟 stands

for individuals “in view” during the interval j (i.e., survived in this period) and 𝑑 means a patient

suffered a failure (e.g., death) during this interval. The overall loss function is the sum of the losses

for each time interval (Gensheimer & Narasimhan, n.d.).

5.3.4 Training process and Transfer Learning

Training a CNNs-based survival model needs to finetune a large number of features. Given this

simple CNNs architecture, there were 1,091,699 trainable parameters. As such, the larger dataset,

cohort 1, was used to pre-train the network. In the cohort, 422 patients had 5,479 slices containing

manually contoured tumor regions. However, the region of interest (ROI) on some of the slices

were so small as shown in Figure 5.3. To solve this, we rank those slices using their ROI size and

picked the top 3,000 slices.

81

These 3,000 slices were fed into the CNNs model. After training the initial model, all the weights

in the pre-trained model were frozen except for the dense layers. Later, 68 patients from Cohort 2

were used to finetune these two dense layers which contain 627 parameters. Although PDAC and

NSCLC images are all from CT, these two diseases have different survival patterns and hence, we

hypothesized that transfer learning with finetuning the final layers was necessary to optimize the

performance of the networks for modeling PDAC survival.

5.3.5 Traditional Radiomics analytic pipeline

In Cohort 2 and Cohort 3, 2D radiomics features were extracted from the manually contoured

regions using PyRadiomics library (version 2.0), generating 1,676 features in total (van

Griethuysen et al., 2017). These features were selected using lasso-CPH (Tibshirani, 1997) in

Cohort 2. The significant features were later tested in Cohort 3. The performance was measured by

the concordance index, which will be further compared with that of the CNN-Survival.

82

Figure 5.1 The proposed CNN-Survival architecture: 6-layer CNN batch normalization (BN)

and max pooling layers. There are also three dropout layers to control the potential overfitting.

83

Figure 5.2 Example of the input CT images

Left: NSCLC tumor from Cohort 1. Right: PDAC tumor from Cohort 2)

Figure 5.3 Example of the small ROI in Cohort 1

84

5.4 Results

Pre-training the proposed CNN-Survival with Cohort 1, given learning rate of 0.0001, the loss

decreased significantly with the first ten epochs where the loss of training and validation sets

converged as shown in Figure 5.4. In the transfer learning process, training and testing group also

converged very quickly and loss comes to the same level as the pertained model.

Figure 5.4 Loss changes during pre-training

Concordance index (CI) was used to measure the fit of the survival function. In cohort 1, CNN-

Survival has a CI of 0.684. In the testing (Cohort 3), the proposed model achieved CI at 0.628.

Traditional radiomics approach yields CI of 0.442 using lasso-cox-regression. This result indicates

that the CNN-Survival model provided a better fit for survival patterns compared to the

conventional radiomics analytic pipeline.

85

Table 5.1: Results of two approaches for the concordance index

Cohort 1 Cohort 3

CPH 0.442

Proposed CNN-Survival 0.684 0.628

As discussed above, CNN-Survival could depict the survival probability of a patient at given time.

We plotted the survival probability of two patient (one survived versus one deceased) in the testing

cohort in Figure 5.5 and Figure 5.6.

Figure 5.5: Survival probability curve generated by the CNN-Survival for a patient who died 511

days after CT using the ROI from CT scans

86

Figure 5.6: Survival probability curve generated by the CNN-Survival for a patient who survived

2415 days from CT using ROI from CT scans

5.5 Discussion

Using the proposed CNN-Survival model, the prognosis performance is further improved. Deep

learning networks provide flexibility in modifying the dimension of feature space and loss function,

enabling us to extract disease-specific features and build more precise models. Using a CNN-based

survival model, we showed that, with the help of transfer learning, deep learning architectures can

outperform traditional pipeline in a typical small sample size setting in modelling survival for

PDAC patients. The proposed transfer learning-based CNN-Survival models have huge potential.

For example, researchers could pre-train a model using images from common cancers with larger

datasets and transfer this model to target rare cancers. Transfer learning-based CNN-Survival model

mitigates the needs for large sample size, allowing more the model to be applied to a wide range of

cancer sites.

87

The proposed CNN-Survival model provides better performances compared to the traditional

radiomics analytic pipeline. With the modified loss function, CNN-Survival does not rely on the

linear assumption, making it suitable for more real-world scenarios. In the testing cohort, the

proposed CNN-Survival achieved a concordance index of 0.628. Although there was no prior work

in the PDAC field, the concordance index of our proposed CNN-Survival is comparable to the

typical CI for biomedical applications (Schmid, Wright, & Ziegler, 2016).

In this research, due to the small sample size in PDAC cohorts, the proposed CNN was not optimal.

We used CT images from 68 patients to finetune the pre-trained CNN-Survival and tested in another

30 patients. Although through transfer learning, most of the parameters were trained using the pre-

train cohort, there were still a large number of parameters needed to be modified through finetuning.

Consequently, the small sample size may hamper the process. Thus, a larger dataset was available,

performance may be further improved. Additionally, the pre-trained domain is CT images from

NSCLC patients. Although it is the largest open source dataset we could find, Non-Small Cell Lung

Cancer has different biological background and survival patterns compared to PDAC. In future

research, using a similar pre-trained domain and a larger fine-tune cohort, further improvement can

be expected.

5.6 Conclusion

The proposed CNNs-based survival model outperforms traditional radiomics pipeline in PDAC

prognosis. This approach offers a better fit for survival patterns based on CT images and overcomes

the limitations of conventional survival model.

88

Chapter 6: General Discussion

6.1: Study 1

6.1.1 Discussion

In the past a few years, a large number of prognostic radiomics markers have been found for

different types of cancers (Cozzi et al., 2019; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al.,

2019; D. Kumar et al., 2015; V. Kumar et al., 2013; van Griethuysen et al., 2017; Yucheng

Zhang et al., 2017). However, radiomics-based prognosis models have often displayed limited

performance (Aerts et al., 2014; Yucheng Zhang et al., 2017). Although a large number of

features have been identified, most of these features are highly correlated. In Pancreatic Ductal

Adenocarcinoma (PDAC) prognostication, it has been found that “dissimilarity” and “inverse

difference normalized” are significantly associated with clinical outcomes (Eilaghi et al., 2017).

However, these two features are reciprocal to each other. Under this condition, building a model

using “dissimilarity” alone, would have a similar performance as a model with two features since

the additional features fail to add any further information. In practice, the multicollinearity of

radiomic features harms the performance of prognosis models.

Transfer learning methods have shown potentials in image recognition tasks, especially in

studies with small sample sizes (Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou,

Meng-Hsi Wu, n.d.; Ravishankar et al.). In the medical imaging domain, it has been shown that

prognosis models using transfer learning methods achieve better performance (Lao et al., 2017).

However, most of the transfer learning models used in medical imaging are ImageNet pre-

trained. As a natural images database, ImageNet contains color scale images with R, G, B

channels. In order to apply the ImageNet pre-trained models, researchers often make copies of

grey scale medical images into R, G, B channels. This is not optimal since color information in

these channels is an important feature in ImageNet pre-trained CNNs. Compared to natural

images, medical images from CT or MR have different signal to noise profiles. A kernel, which

is able to extract texture information from the natural images, may lose its function in medical

images. Thus, directly adopting the ImageNet pre-trained models may not be the optimal

method.

89

In this study, we trained three prognosis models for PDAC, using the PyRadiomics based model,

the ImageNet pre-trained ResNet, and the Lung CT image pre-trained ResNet models. The first

two feature banks are commonly used in previous medical imaging studies. We have shown that,

for PDAC prognosis tasks, Lung CT image pre-trained ResNet provides the most informative

features and significantly higher prognosis performance compared to the other two feature banks.

This result suggests that medical imaging-based pre-trained CNNs may serve as high-

performance feature bank for future studies in cancer prognosis. Additionally, building an open-

source medical imaging pre-trained CNN would potentially benefit further medical imaging

studies for prognostication of cancer.

6.1.1.1 Feature analysis

It has been shown that radiomics features have significant associations with the overall survival

in resectable PDAC patients (Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019). In this

study, we found that, among 1,428 radiomics features, 283 features had significant associations

with OS. However, due to a large number of tests, multiple comparison problem becomes

unavoidable. After FDR or Bonferroni control, none of those factors remained significant. To

reduce the number of comparisons, Khalvati et al. implement an interclass correlation (ICC)

filter before the feature analysis (Khalvati, Zhang, Baig, et al., 2019). However, the ICC filter

method needs at least two readers which is not feasible for many studies. To address this issue,

multi-center studies are needed to eliminate unstable features and fundamentally reduce the

number of testing.

For transfer learning features, the ImageNet pre-trained feature extractor generated 49

significant features, while the Lung CT based feature extractor produced two significant features.

Although the number of significant features was below that of PyRadiomics, as discussed above,

comparing the number or ratio of significant features is not appropriate. It has been shown that a

high number of features does not necessarily lead to a high-performance prognosis model

(Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). Hence, comparing prognosis

90

performance on the same but independent validation cohort should the gold standard in

comparing the performance of different feature banks.

6.1.1.2 Prognostic model performance

This study is one of the first studies which tested the performance of prognosis models in

resectable PDAC cohorts (Khalvati, Zhang, Baig, et al., 2019). In other types of cancers,

radiomics-based prognosis models achieve AUC ranging from 0.55 to 0.9 (Hawkins et al., 2016;

Huynh et al., 2016; Parmar, Grossmann, et al., 2015; van Griethuysen et al., 2017; Yucheng

Zhang et al., 2017). Using traditional radiomics analytics pipeline proposed by Zhang et al.,

PyRadiomics prognosis model achieved AUC of 0.57 in the validation cohort (Yucheng Zhang et

al., 2017). The AUC was lower than that of other radiomics studies. Sample size limitation may

contribute to the lower performance. Given that our training set has only 68 patients, which is

significantly lower than most radiomics studies, the prognosis model may be undertrained (V.

Kumar et al., 2013; van Griethuysen et al., 2017).

For the transfer learning model, ImgRes achieved AUC of 0.71 and LungRes yielded 0.74 AUC.

Although the transfer learning feature extraction produced a smaller number of significant

features, their prognosis performance was significantly higher than that of the PyRadiomics

model (AUC = 0.57). Hence, future research on radiomics should not only focus on reporting the

significance of image features but also report the amount of variation they explain. We have

shown that, the LungRes-based prognosis model performed significantly better than the

PyRadiomics model. However, due to the sample size limitations, we did not have enough

statistical power to test whether a significant difference exists between the ImgRes and LungRes

model. Further research should concentrate on this issue.

6.1.1.3 Risk score

Many radiomics studies proposed image feature based risk scores for different types of cancers

(Cozzi et al., 2019; Khalvati, Zhang, Baig, et al., 2019; Lao et al., 2017). Risk scores can be

91

derived from logistic regression or other parametric or semi-parametric methods. Patients can

then be divided into low-risk or high-risk groups using medians of the scores. In the end, Cox

Proportional Hazard model is often used to compare if any significant difference in survival

patterns exists between the two groups. Compared to the binary prognosis model, risk score

analysis takes the duration into account, and presents it in Kaplan Meier curves which are more

informative and interpretable (Cozzi et al., 2019; Lao et al., 2017).

In this study, we produced three sets of risk scores using models trained from three different

feature banks. We discovered that, the PyRadiomics and the ImgRes-based risk scores are not

significantly associated with the overall survival in the independent validation cohort. In

contrast, the LungRes-based risk score had a significant p value. This suggests that medical

images pre-trained CNNs were not only able to provide binary predictions on survival, but also

good at offering precise prognosis with respect to time.

6.1.2 Strength and limitations

6.1.2.1 Strength

Although previous studies have identified few radiomics features for the PDAC prognosis, the

performance of those features has not been validated (Eilaghi et al., 2017; Khalvati, Zhang, Baig,

et al., 2019). In this study, we tested the performance of the PyRadiomics feature bank by using

two independent cohorts and demonstrated the limitation of the current radiomics pipeline.

Lacking validation is a common problem in radiomics studies. We are one of the first groups to

provide cross-center validations for radiomics features in the context of resectable PDAC. Cross-

center validation produces more reliable results and should become a standard protocol in future

radiomics studies.

In addition to independent validation cohorts, in this study, we highlighted that, transfer learning

methods provided prognostic features for OS in PDAC patients. Additionally, we confirmed that,

transfer learning model could achieve comparable prognosis performance with a small sample

size (n < 100). Transfer learning methods have enormous potential for rare diseases, or studies

using limited datasets. Our study confirmed that, in prognostication, the medical image pre-

92

trained model has a comparable or a higher performance when compared to the ImageNet pre-

trained model, even though the medical image pre-trained model was tuned by less than 1000

images. Hence, medical image pre-trained CNNs are remarkably valuable for medical imaging

studies, providing high-performance feature banks, and working well in small sample settings.

6.1.2.2 Limitations

Ad discussed above, one of the limitations of this study was its sample size. Due to the small

sample size in validation cohorts (n=30), we were unable to compare the prognosis performance

between the ImgRes and LungRes models in terms of AUC. Although we found a significant

difference in the risk score analysis, ROC curve comparison was not significant between

LungRes and ImgRes. Further research with a larger dataset is required to further investigate the

performance of these transfer learning models.

Another limitation of this study is the lack of feature fusion. It has been shown that, radiomics

feature and transfer learning could be fused together to achieve better performance (Lao et al.,

2017). Further investigation is needed to find an optimal way of fusing radiomics and transfer

learning features.

6.1.3 Implications

Compared to radiomics features where formulas are manually defined, transfer learning features

provide target-specific features which lead to the high-performance in the prognosis model. In

image related tasks, the ImageNet pre-trained model has been very popular. However, the unique

nature of medical imaging requires another well-trained transferable model. Hence, follow-up

research should focus on developing a medical image pre-trained model which will benefit future

studies for prognosis models.

Additionally, as more studies started to adopt transfer learning methods, the dimension of

features has expanded at a very fast pace. Investigating the relationship between radiomics and

93

transfer learning features and finding the optimal way to reduce features are urgent issues for

subsequent studies.

6.2: Study 2

6.2.1 Discussion

Radiomics is a rapidly evolving field of study. In the past decade, feature size has expanded from

less than one hundred to a few thousand (van Griethuysen et al., 2017). With the addition of

transfer learning features, feature dimensions will continue to grow (D. George et al., 2017). As

the number of features increases, it is expected to have extra information which was not available

in previous feature banks. However, rapidly growing feature banks worsens the multiple

comparison and multicollinearity problems which already exist in the current radiomics pipeline.

Further studies are required to investigate the relationship between engineered PyRadiomics

features and transfer learning features. Will transfer learning feature extractors produce similar

features to PyRadiomics, or, will the PyRadiomics and transfer learning feature extractors have

entirely different formulas? Answers for these questions will contribute to this evolving field in

this transition period. Furthermore, it was hypothesized that, feature fusion would improve the

prognosis performance since it adds more information. However, identifying the optimal way of

fusing features is still required.

In this study, we extracted features using the PyRadiomics, the ImgRes and the LungRes feature

banks, investigated their correlations, proposed a risk score-based fusion method, and compared

their performances with those of the other feature fusion methods. We found that, the risk-score

based fusion method provides the best prognosis performance. It has been shown that, building

multiple models improves the classification accuracy and our results confirmed this in the

context of medical imaging (P. Yang et al., n.d.).

94

6.2.1.1 Correlation between radiomics and transfer learning features

Multicollinearity is a major limitation in radiomics studies (Parmar, Grossmann, et al., 2015;

Yucheng Zhang et al., 2017). In this study, we highlighted that, on average, PyRadiomics

features have high correlations among its features. There are features which have correlation

coefficients higher than 0.7, showing strong correlations with other PyRadiomics features. For

example. “gradient_glcm_ClusterTendency” and “gradient_firstorder_variance” have a

correlation coefficient of 0.998. These high correlations may explain the weak performance of

the PyRadiomics feature bank in prognosis tasks without proper preprocessing. Comparatively,

the ImgRes and LungRes generated features have lower average correlation coefficients.

In addition, we calculated the correlations between these three feature banks. We have shown

that most transfer learning features and PyRadiomics features had no or weak associations

between each other, suggesting that feature fusion may provide additional information to the

prognosis model and improve its performance. However, few feature pairs existed with strong

associations. We noticed that the first order radiomics features generally have higher correlations

with transfer learning features. This result suggested that, some transfer learning features were

able to capture information similar to first order as PyRadiomics features. Through

backpropagation, CNN realized the benefits of capturing first order features and updated its

kernels in this way.

By investigating the correlation between three feature banks, we now have a better understanding

of their associations. PyRadiomics features and transfer learning features should be considered as

complementary rather than replacements of each other. Although the application of transfer

learning will be increasingly common in medical imaging studies, engineered radiomics features

are also valuable and should not be discarded.

6.2.1.2 Prognosis performance for different fusion methods

In previous radiomics studies, it has been shown that feature fusion can improve the model’s

performance (Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). Inspired by multi-

stream CNNs, we proposed a new risk-score based fusion method and compared its performance

95

in PDAC prognostication to that of other feature fusion methods. This model achieved higher

AUC in prognosis tasks (AUC = 0.83).

It should be noted that, given the fast-growing nature of feature banks, supervised feature

selection methods may fail to provide informative guidance due to the multiple comparison

problems. In our study, three feature banks generated more than two thousand features. Testing

these features with clinical outcome individually, we have more than 99% chance to have at least

one false positive. Additionally, the multicollinearity issue is worsened after supervised feature

selection. If one feature is significant, then other similar features will be more likely to be

significant. Given that, supervised feature selection gives a selected feature space with a high

number of correlated and false positive features, hampering the model’s performance.

As an alternative feature selection method, Boruta performed well in other domains. However, in

this study, we found that Boruta feature selection method delivered the worst result. One reason

behind that is the sample size issue. Since the sample size is small, Boruta algorithm lacked the

statistical power to differentiate between random features and meaningful features. As a result,

only two features were selected by Boruta even when the cut off was evaluated from 0.05 to 0.1.

Since these two features failed to explain much variance in patients’ outcomes, the Boruta

guided model yielded the worst performance among the four fusion methods. In a large sample

size setting, Boruta guided feature reduction may be able to achieve acceptable results.

It has been shown that, given a large feature size, unsupervised feature fusion methods provide

the best performance (Yucheng Zhang et al., 2017). Our research confirmed these previous

findings. PCA based feature fusion method achieved similar performance compared to models

discussed in Study 1. It is interesting to note that, when testing the prognosis performance for

each feature bank, LungRes had AUC of 0.74, while the PCA based feature fusion model had

similar AUC (AUC = 0.72). This result indicated that, the PCA based feature fusion model was

able to collect and fuse useful information in a small number of components, reducing the

computation time and the overall complexity of the model.

In the end, the risk-score fusion method provided the best overall performance in the PDAC

prognosis. Since risk-score was generated by the random-forest, these scores are non-linear

mappings of the original feature space. Hence, they provide a better fit for complex patterns. In

96

future medical imaging studies, as more transfer learning-based feature banks are established,

risk-score based fusion will play an important role.


6.2.2.1 Strength

We were one of the first groups to investigate the relationships between transfer learning and

radiomics features for PDAC in CT images. The correlation mapping is valuable for future

medical imaging-based transfer learning studies. We identified that transfer learning features

could resemble certain first-order radiomics features which depict the shape and distribution of

pixel intensities. These correlations may become the foundation of fusing radiomics and transfer

learning features.

Furthermore, we tested the prognosis performance of the proposed fusion method with other

three existing feature reduction methods in the PDAC prognosis. The best-performed model

achieved AUC of 0.83 which is currently highest in this subject, outperforming other prognosis

biomarkers including image marker and CA19-9. Besides that, this AUC was achieved in an

independent validation cohort, avoiding the common circular reasoning problem.

In the end, the high performance of risk-score fusion methods demonstrated its potential for

cancer prognosis. As more researches start to realize the importance of transfer learning, more

transfer learning feature banks will be developed in the medical imaging field. Instead of

selecting features from those feature bank through supervised feature reductions, fusing features

using model-generated risk-scores may provide a better performance.

6.2.2.2 Limitations

The main limitation of this study is sample size. Given the small sample size (Total sample size:

98, sample size of training data: 68), the Boruta method did not distinguish meaningful features,

97

resulting in a lower performance model. Testing different methods using a larger dataset will

provide stronger evidence against the null hypothesis.

Secondly, although the risk-score method achieved the best performance, interpretation was

challenging. Compared to supervised feature selection methods where formulas can be

identified, the risk score comes from a non-linear combination of features. Building a model on

top of other models can be considered as a black box. Further investigation is required to address

this issue.

Finally, in this study, we used a binary outcome (survival vs. death). For cancer with poor

prognosis, the binary prognosis may not be meaningful. A model, which provides probability at

any given time, may be more practical and translational.

6.2.3 Implications

We have shown that, transfer learning features and radiomics features have a complementary

relationship. Although transfer learning features may achieve higher performance in specific

tasks, radiomics studies are still valuable. Comparing the results using transfer features only, the

fusion method had better performance, confirming the additional value of radiomics features.

As more studies in the medical imaging field started to adopt transfer learning methods, an

increasing number of feature banks will become available. Our results suggested that the

proposed risk-score feature fusion method may become the standard protocol in the deep

radiomics analytic pipeline. Additionally, given the typical large P small N dataset in radiomics

studies, caution should be taken when applying supervised feature selection approaches.

6.3: Study 3

6.3.1 Discussion

As a statistical method, survival analysis is commonly used in clinical research, depicting

patients’ survival patterns and identifying potential risk factors (Lao et al., 2017). As a

98

semiparametric model, the Cox Proportional Hazard model is often used in translational

research. However, the CPH model assumes that features make a linear contribution to the risk,

oversimplifying the relationship between biomarkers and outcomes. In this study, we used a

modified loss function in CNN and achieved higher concordance index when compared to that of

radiomics markers.

It has been shown that modified loss function in deep learning architectures will make a stronger

model compared to traditional CPH (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).

Non-linear activation layers in current deep learning architectures provide a non-linear mapping

of input and output, enabling the CNN-Survival architecture to model non-linear survival

patterns. However, in previous studies, the proposed deep learning-based survival models take

vectors as inputs, including age, gender, and other clinical factors. In this study, through transfer

learning, we were able to train a deep learning-based survival model (CNN-Survival) using CT

images. Kernels in this prognosis model were optimized for extracting features related to overall

survival. Hence, the CNN-Survival model outperforms traditional radiomics model in terms of

concordance index.

Training a six-layer CNN-Survival model needs to tune more than 1 million parameters. This is

not feasible in most small sample studies. However, using transfer learning, top convolution

layers can be trained using another dataset with a larger sample size. Compared to other transfer

learning studies which require medical images and binary outcome, pre-training a CNN-Survival

needs not only binary outcome but also duration information. This is costly to collect. In this

study, we used a popular Non-Small Cell Lung Cancer dataset which contains CT images and

survival data for 422 patients. In the future, if larger dataset (sample size > 1000) becomes

available, the CNN architectures can be deeper compared to the six-layer CNN used in this

study. Hence, the pre-trained model would be able to extract more informative features, further

improving the prognosis performance.

Future radiomics studies will evolve from manually defined features to a fusion of pre-defined

features and transfer learning features. Additionally, the prognosis models will progress from

binary prognosis models to hazard probability models. Consequently, new survival modelling

methods are required to handle complex tasks. This CNN-Survival model made a step towards

99

this goal. With larger pre-trained datasets and validation cohorts, the proposed model has the

potential of being a standardized process in future deep radiomics analytics pipeline.


6.3.2.1 Strength

We were the first group to use a modified loss function and transfer learning method in training a

CNN-based survival model for PDAC CT images. In an independent validation cohort, our

CNN-Survival achieved a better concordance index when compared to that of traditional

radiomics models. Our results suggested that, instead of building binary prognosis models,

training a CNN based survival model was also feasible in a small cohort with the help from

transfer learning.

In addition, the LungRes pre-trained CNN used in Study 1 and Study 2 was tuned using binary

prognosis (survival versus death). In this study, the pre-trained model was optimized based on

survival probability at a given time. By doing so, deep radiomics features extracted from this

pre-trained CNNs would be theoretically associated with patients’ survival patterns, offering

precise prognosis information for healthcare professionals.

6.3.2.2 Limitations

As discussed above, training a CNN-Survival needs not only the binary outcome, but also

duration information which is difficult to collect. The largest related open-source data we could

find had 422 patients collected by Aerts et al. (Aerts et al., 2014). In order to fully take

advantage of this data, instead of taking the largest ROI from a patient, we extracted ROI from

every slice with expert’s annotations. In the end, we gathered 5,000 ROI from those 422 patients.

However, these ROIs are not independent since many of them are adjacent to each other. In

addition, some of the ROIs can be extremely small, making it difficult to extract useful features.

Although a large dropout rate was applied to control the overfitting, the model started to overfit

after 20 epochs. With a larger dataset (sample size > 1000), this issue can be resolved.

100

Chapter 7: Conclusions

Our studies showed that, for resectable Pancreatic Ductal Adenocarcinoma, transfer learning-

based deep radiomics features had the potential of providing more accurate prognostications

compared to conventional manually-defined radiomics features.

Study 1 suggested that performances of transfer learning features were associated with the pre-

trained domain. The LungRes, which was pre-trained by Lung CT images, outperformed the

ImgRes, which was ImageNet pre-trained. This result indicated that, for future studies, medical

images pre-trained feature extractors would play a more important role. In study 2, through

fusing features from pre-trained feature extractors and the conventional PyRadiomics feature

bank, prognostication performance was further improved from 0.74 to 0.83 in terms of AUC,

indicating that transfer learning and conventional radiomic features carried different information

from medical images.

In the final study, through modifying the loss function in the CNN architecture, a CNN-Survival

model was trained. Taking CT scans as inputs and returning survival probabilities at given time,

the CNN-Survival had the potential of providing more practical information for healthcare

providers in designing personalized treatment plans for resectable PDAC patients.

These three studies provided pieces of evidence that transfer learning approaches had substantial

potential for the medical imaging field. Through transfer learning, more information can be

extracted from medical images, contributing to the improvement of prognosis performances in

our resectable PDAC cohorts.

101

Chapter 8: Future directions

Our research provided pieces of pieces of evidence that medical image pre-trained CNNs have

the potential to become standardized feature extractors. Compared to the ImageNet pre-trained

model, we have shown that, medical image pre-trained CNNs may be more suitable for medical

imaging tasks compared to CNNs pretrained with natural images. In this study, we adopted two

open-source Lung CT databases with 888 and 422 patients. If future studies adopt larger pre-

train datasets with CT images from more than 1000 patients, the transfer learning approaches

may have further improved performances. We anticipate that, in the near future, these high-

performance pre-trained CNNs will become standardized deep radiomics feature extractors.

Secondly, current deep radiomics researches are mainly focusing on the overall survival as the

outcome. However, diagnosis and predicting treatment response for PDAC and other types of

cancers are also extremely valuable to patients and healthcare professionals. Further deep

radiomics research should not only focus on the prognostication of disease but also aim to solve

other relevant research questions including early detection and designing personalized treatment

plans.

Finally, future deep radiomics studies should focus on visualization and interpretations. It is

critical to understand how deep radiomics features capture information, and which type of deep

radiomics features plays an essential role in the context of resectable PDAC prognostications.

This will assist researchers and healthcare professionals in translating deep radiomics studies into

clinical practice.

102

References

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary

Reviews: Computational Statistics, 2(4), 433–459. https://doi.org/10.1002/wics.101

Abdullah, S. L. S., Hambali, H., & Jamil, N. (2012). An accurate thresholding-based

segmentation technique for natural images. In 2012 IEEE Symposium on Humanities,

Science and Engineering Research (pp. 919–922). IEEE.

https://doi.org/10.1109/SHUSER.2012.6269007

Adamska, A., Domenichini, A., & Falasca, M. (2017a). Pancreatic Ductal Adenocarcinoma:

Current and Evolving Therapies. International Journal of Molecular Sciences, 18(7).

https://doi.org/10.3390/ijms18071338

Adamska, A., Domenichini, A., & Falasca, M. (2017b). Pancreatic Ductal Adenocarcinoma:

Current and Evolving Therapies. International Journal of Molecular Sciences, 18(7).

https://doi.org/10.3390/ijms18071338

Aerts, H. J., Velazquez, E. R., Leijenaar, R. T., Parmar, C., Grossmann, P., Carvalho, S., …

Lambin, P. (2014). Decoding tumour phenotype by noninvasive imaging using a

quantitative radiomics approach. Nat Commun, 5, 4006.

https://doi.org/10.1038/ncomms5006

Afshar, P., Mohammadi, A., Plataniotis, K. N., Oikonomou, A., & Benali, H. (n.d.). From Hand-

Crafted to Deep Learning-based Cancer Radiomics: Challenges and Opportunities.

Retrieved from https://arxiv.org/pdf/1808.07954.pdf

Ahmad, N. A., Lewis, J. D., Ginsberg, G. G., Haller, D. G., Morris, J. B., Williams, N. N., …

Kochman, M. L. (2001). Long term survival after pancreatic resection for pancreatic

adenocarcinoma. The American Journal of Gastroenterology, 96(9), 2609–2615.

https://doi.org/10.1111/j.1572-0241.2001.04123.x

Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2018). Recurrent Residual

Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image

Segmentation. Retrieved from http://arxiv.org/abs/1802.06955

André, T., de Gramont, A., Vernerey, D., Chibaudel, B., Bonnetain, F., Tijeras-Raballand, A., …

de Gramont, A. (2015). Adjuvant Fluorouracil, Leucovorin, and Oxaliplatin in Stage II to

III Colon Cancer: Updated 10-Year Survival and Outcomes According to BRAF Mutation

103

and Mismatch Repair Status of the MOSAIC Study. Journal of Clinical Oncology, 33(35),

4176–4187. https://doi.org/10.1200/JCO.2015.63.4238

Antony, J., McGuinness, K., Connor, N. E. O., & Moran, K. (2016). Quantifying radiographic

knee osteoarthritis severity using deep convolutional neural networks. Quantifying

Radiographic Knee Osteoarthritis Severity Using Deep Convolutional Neural Networks.

Retrieved from http://arxiv.org/abs/1609.02469

Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M. K. (2018).

Medical Image Analysis using Convolutional Neural Networks: A Review. Journal of

Medical Systems, 42(11), 226. https://doi.org/10.1007/s10916-018-1088-1

Armato, S. G., McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., …

Clarke, L. P. (2011). The Lung Image Database Consortium (LIDC) and Image Database

Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans.

Medical Physics, 38(2), 915–931. https://doi.org/10.1118/1.3528204

Arnold, L. D., Patel, A. V., Yan, Y., Jacobs, E. J., Thun, M. J., Calle, E. E., & Colditz, G. A.

(2009). Are Racial Disparities in Pancreatic Cancer Explained by Smoking and

Overweight/Obesity? Cancer Epidemiology Biomarkers & Prevention, 18(9), 2397–2405.

https://doi.org/10.1158/1055-9965.EPI-09-0080

B, R. W. (2013). Advances in Neural Networks – ISNN 2013, 7951, 12–20.

https://doi.org/10.1007/978-3-642-39065-4

Bai, Y., Lin, Y., Tian, J., Shi, D., Cheng, J., Haacke, E. M., … Wang, M. (2016). Grading of

Gliomas by Using Monoexponential, Biexponential, and Stretched Exponential Diffusion-

weighted MR Imaging and Diffusion Kurtosis MR Imaging. Radiology, 278(2), 496–504.

https://doi.org/10.1148/radiol.2015142173

Ballehaninna, U. K., & Chamberlain, R. S. (2012). The clinical utility of serum CA 19-9 in the

diagnosis, prognosis and management of pancreatic adenocarcinoma: An evidence based

appraisal. Journal of Gastrointestinal Oncology, 3(2), 105–119.

https://doi.org/10.3978/j.issn.2078-6891.2011.021

Bartel, D. P. (2009). MicroRNAs: Target Recognition and Regulatory Functions. Cell, 136(2),

215–233. https://doi.org/10.1016/j.cell.2009.01.002

Becker, A. E., Hernandez, Y. G., Frucht, H., & Lucas, A. L. (2014). Pancreatic ductal

adenocarcinoma: Risk factors, screening, and early detection. World Journal of

104

Gastroenterology, 20(32), 11182–11198. https://doi.org/10.3748/wjg.v20.i32.11182

Benson, A. B., Venook, A. P., Al-Hawary, M. M., Cederquist, L., Chen, Y.-J., Ciombor, K.

K., … Freedman-Cass, D. A. (2018). NCCN Guidelines Insights: Colon Cancer, Version

2.2018. Journal of the National Comprehensive Cancer Network, 16(4), 359–369.

https://doi.org/10.6004/jnccn.2018.0021

Blagus, R., Lusa, L., Bishop, C., He, H., Garcia, E., Daskalaki, S., … Klaar, S. (2013). SMOTE

for high-dimensional class-imbalanced data. BMC Bioinformatics, 14(1), 106.

https://doi.org/10.1186/1471-2105-14-106

Bloomston, M., Frankel, W. L., Petrocca, F., Volinia, S., Alder, H., Hagan, J. P., … Croce, C. M.

(2007). MicroRNA Expression Patterns to Differentiate Pancreatic Adenocarcinoma From

Normal Pancreas and Chronic Pancreatitis. JAMA, 297(17), 1901.

https://doi.org/10.1001/jama.297.17.1901

Bootcov, M. R., Bauskin, A. R., Valenzuela, S. M., Moore, A. G., Bansal, M., He, X. Y., …

Breit, S. N. (1997). MIC-1, a novel macrophage inhibitory cytokine, is a divergent member

of the TGF- superfamily. Proceedings of the National Academy of Sciences, 94(21),

11514–11519. https://doi.org/10.1073/pnas.94.21.11514

Bosetti, C., Lucenteforte, E., Silverman, D. T., Petersen, G., Bracci, P. M., Ji, B. T., … La

Vecchia, C. (2012). Cigarette smoking and pancreatic cancer: an analysis from the

International Pancreatic Cancer Case-Control Consortium (Panc4). Annals of Oncology,

23(7), 1880–1888. https://doi.org/10.1093/annonc/mdr541

Breiman, L. (2001). Random Forests, 1–33.

Breiman, L., & Leo. (1996). Stacked regressions. Machine Learning, 24(1), 49–64.

https://doi.org/10.1023/A:1018046112532

Buckhaults, P., Rago, C., Vogelstein, B., St. Croix, B., Romans, K. E., Saha, S., … Kinzler, K.

W. (2001). Secreted and cell surface genes expressed in benign and malignant colorectal

tumors. Cancer Research, 61(19), 6996–7001.

Bünger, S., Laubert, T., Roblick, U. J., & Habermann, J. K. (2011). Serum biomarkers for

improved diagnostic of pancreatic cancer: a current overview. Journal of Cancer Research

and Clinical Oncology, 137(3), 375–389. https://doi.org/10.1007/s00432-010-0965-x

Capello, M., Lee, M., Wang, H., Babel, I., Katz, M. H., Fleming, J. B., … Hanash, S. M. (2015).

Carboxylesterase 2 as a Determinant of Response to Irinotecan and Neoadjuvant

105

FOLFIRINOX Therapy in Pancreatic Ductal Adenocarcinoma. JNCI: Journal of the

National Cancer Institute, 107(8). https://doi.org/10.1093/jnci/djv132

Caponi, S., Funel, N., Frampton, A. E., Mosca, F., Santarpia, L., Van der Velde, A. G., …

Giovannetti, E. (2013). The good, the bad and the ugly: a tale of miR-101, miR-21 and

miR-155 in pancreatic intraductal papillary mucinous neoplasms. Annals of Oncology,

24(3), 734–741. https://doi.org/10.1093/annonc/mds513

Chandrakumar, T., & Kathirvel, R. (n.d.). Classifying Diabetic Retinopathy using Deep Learning

Architecture. Retrieved from www.ijert.org

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic

minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

https://doi.org/10.1613/jair.953

Chen, S.-Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for multiple

comparisons. Journal of Thoracic Disease, 9(6), 1725–1729.

https://doi.org/10.21037/jtd.2017.05.34

Chen, X., Oshima, K., Schott, D., Wu, H., Hall, W., Song, Y., … Li, X. A. (2017). Assessment

of treatment response during chemoradiation therapy for pancreatic cancer based on

quantitative radiomic analysis of daily CTs : An exploratory study. PLoS ONE, 12(6), 1–14.

https://doi.org/10.1371/journal.pone.0178961

Chen, Y.-Z., Liu, D., Zhao, Y.-X., Wang, H.-T., Gao, Y., & Chen, Y. (2014). Diagnostic

Performance of Serum Macrophage Inhibitory Cytokine-1 in Pancreatic Cancer: A Meta-

Analysis and Meta-Regression Analysis. DNA and Cell Biology, 33(6), 370–377.

https://doi.org/10.1089/dna.2013.2237

Ching, T., Zhu, X., & Garmire, L. X. (2018). Cox-nnet: An artificial neural network method for

prognosis prediction of high-throughput omics data. PLoS Computational Biology, 14(4),

e1006076. https://doi.org/10.1371/journal.pcbi.1006076

Cho, J., Lee, K., Shin, E., Choy, G., & Do, S. (2016). HOW MUCH DATA IS NEEDED TO

TRAIN A MEDICAL IMAGE DEEP LEARNING SYSTEM TO ACHIEVE NECES-SARY

HIGH ACCURACY? Retrieved from https://arxiv.org/pdf/1511.06348.pdf

Chollet, F., & Others, A. (2015). Keras. Retrieved from https://keras.io

Christ, P. F., Elshaer, M. E. A., Ettlinger, F., Tatavarty, S., Bickel, M., Bilic, P., … Menze, B. H.

(2016). Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully

106

Convolutional Neural Networks and 3D Conditional Random Fields (pp. 415–423).

Springer, Cham. https://doi.org/10.1007/978-3-319-46723-8_48

Christ, P. F., Ettlinger, F., Grün, F., Ezzeldin, M., Elshaer, A., Lipková, J., … Menze, B. (n.d.).

Automatic Liver and Tumor Segmentation of CT and MRI Volumes Using Cascaded Fully

Convolutional Neural Networks. Retrieved from https://arxiv.org/pdf/1702.05970.pdf

Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, and E. Y. C. (n.d.).

Transfer Representation Learning for Medical Image Analysis. Retrieved from

http://infolab.stanford.edu/~echang/HTC_OM_Final.pdf

Cires¸ancires¸an, D. C., Giusti, A., Gambardella, L. M., & Urgen Schmidhuber, J. ¨. (n.d.). Deep

Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. Retrieved

from http://www.idsia.ch/

Clark, T., Zhang, J., Baig, S., Wong, A., Haider, M. A., & Khalvati, F. (2017). Fully automated

segmentation of prostate whole gland and transition zone in diffusion-weighted MRI using

convolutional neural networks. Journal of Medical Imaging, 4(04), 1.

https://doi.org/10.1117/1.JMI.4.4.041307

Conroy, T., Desseigne, F., Ychou, M., Bouché, O., Guimbaud, R., Bécouarn, Y., … Ducreux, M.

(2011). FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer. New England

Journal of Medicine, 364(19), 1817–1825. https://doi.org/10.1056/NEJMoa1011923

Coroller, T. P., Grossmann, P., Hou, Y., Rios Velazquez, E., Leijenaar, R. T. H., Hermann,

G., … Aerts, H. J. W. L. (2015a). CT-based radiomic signature predicts distant metastasis in

lung adenocarcinoma. Radiotherapy and Oncology, 114(3), 345–350.

https://doi.org/10.1016/j.radonc.2015.02.015

Coroller, T. P., Grossmann, P., Hou, Y., Rios Velazquez, E., Leijenaar, R. T. H., Hermann,

G., … Aerts, H. J. W. L. (2015b). CT-based radiomic signature predicts distant metastasis

in lung adenocarcinoma. Radiotherapy and Oncology, 114(3), 345–350.


Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society.

Series B (Methodological). WileyRoyal Statistical Society. https://doi.org/10.2307/2985181

Cozzi, L., Comito, T., Fogliata, A., Franzese, C., Franceschini, D., Bonifacio, C., … Scorsetti,

M. (2019). Computed tomography based radiomic signature as predictive of survival and

local control after stereotactic body radiation therapy in pancreatic carcinoma. PLOS ONE,

107

14(1), e0210758. https://doi.org/10.1371/journal.pone.0210758

Cui, Y., Song, J., Pollom, E., Alagappan, M., Shirato, H., Chang, D. T., … Li, R. (2016).

Quantitative Analysis of 18F-Fluorodeoxyglucose Positron Emission Tomography

Identifies Novel Prognostic Imaging Biomarkers in Locally Advanced Pancreatic Cancer

Patients Treated With Stereotactic Body Radiation Therapy. International Journal of

Radiation Oncology*Biology*Physics, 96(1), 102–109.

https://doi.org/10.1016/j.ijrobp.2016.04.034

Cunliffe, A., Armato, S. G., Castillo, R., Pham, N., Guerrero, T., & Al-Hallaq, H. A. (2015).

Lung Texture in Serial Thoracic Computed Tomography Scans: Correlation of Radiomics-

based Features With Radiation Therapy Dose and Radiation Pneumonitis Development.

International Journal of Radiation Oncology*Biology*Physics, 91(5), 1048–1056.

https://doi.org/10.1016/j.ijrobp.2014.11.030

De Fauw, J., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., …

Ronneberger, O. (2018). Clinically applicable deep learning for diagnosis and referral in

retinal disease. Nature Medicine, 24(9), 1342–1350. https://doi.org/10.1038/s41591-018-

0107-6

DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two

or more correlated receiver operating characteristic curves: a nonparametric approach.

Biometrics, 44(3), 837–845. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3203132

Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale

hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern

Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning (pp. 1–15). Springer, Berlin,

Heidelberg. https://doi.org/10.1007/3-540-45014-9_1

Dillhoff, M., Liu, J., Frankel, W., Croce, C., & Bloomston, M. (2008). MicroRNA-21 is

Overexpressed in Pancreatic Cancer and a Potential Predictor of Survival. Journal of

Gastrointestinal Surgery, 12(12), 2171–2176. https://doi.org/10.1007/s11605-008-0584-x

Dů, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R., & Singh, A. (n.d.). How

Many Samples are Needed to Estimate a Convolutional Neural Network? Retrieved from

https://papers.nips.cc/paper/7320-how-many-samples-are-needed-to-estimate-a-

convolutional-neural-network.pdf

108

Eibl, A. S. and G. (2015). Pancreatic Ductal Adenocarcinoma. Pancreapedia: The Exocrine

Pancreas Knowledge Base. https://doi.org/10.3998/PANC.2015.14

Eilaghi, A., Baig, S., Zhang, Y., Zhang, J., Karanicolas, P., Gallinger, S., … Haider, M. A.

(2017). CT texture features are associated with overall survival in pancreatic ductal

adenocarcinoma – a quantitative analysis, 1–7. https://doi.org/10.1186/s12880-017-0209-5

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017).

Dermatologist-level classification of skin cancer with deep neural networks. Nature,

542(7639), 115–118. https://doi.org/10.1038/nature21056

Etymologia: Bonferroni correction. (2015). Emerging Infectious Diseases, 21(2), 289.

https://doi.org/10.3201/EID2102.ET2102

Farrell, J. J., Elsaleh, H., Garcia, M., Lai, R., Ammar, A., Regine, W. F., … Mackey, J. R.

(2009). Human Equilibrative Nucleoside Transporter 1 Levels Predict Response to

Gemcitabine in Patients With Pancreatic Cancer. Gastroenterology, 136(1), 187–195.

https://doi.org/10.1053/j.gastro.2008.09.067

Fawcett, T. (2005). An introduction to ROC analysis.

https://doi.org/10.1016/j.patrec.2005.10.010

Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., & Amorim Fernández-Delgado,

D. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification

Problems? Journal of Machine Learning Research, 15, 3133–3181.

Ferrone, C. R., Pieretti-Vanmarcke, R., Bloom, J. P., Zheng, H., Szymonifka, J., Wargo, J. A., …

Warshaw, A. L. (2012). Pancreatic ductal adenocarcinoma: long-term survival does not

equal cure. Surgery, 152(3 Suppl 1), S43-9. https://doi.org/10.1016/j.surg.2012.05.020

FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer. (2011). New England

Journal of Medicine, 365(8), 768–769. https://doi.org/10.1056/NEJMc1107627

Foucher, E. D., Ghigo, C., Chouaib, S., Galon, J., Iovanna, J., & Olive, D. (2018). Pancreatic

Ductal Adenocarcinoma: A Strong Imbalance of Good and Bad Immunological Cops in the

Tumor Microenvironment. Frontiers in Immunology, 9, 1044.

https://doi.org/10.3389/fimmu.2018.01044

Fox, J., & Weisberg, S. (2011). Cox Proportional-Hazards Regression for Survival Data in R.

Retrieved from

https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-

109

Regression.pdf

Fujita, H., Ohuchida, K., Mizumoto, K., Itaba, S., Ito, T., Nakata, K., … Tanaka, M. (2010).

Gene Expression Levels as Predictive Markers of Outcome in Pancreatic Cancer after

Gemcitabine-Based Adjuvant Chemotherapy. Neoplasia, 12(10), 807-IN8.

https://doi.org/10.1593/neo.10458

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism

of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–

202. https://doi.org/10.1007/BF00344251

Gabryś, H. S., Buettner, F., Sterzing, F., Hauswald, H., & Bangert, M. (2018). Design and

Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal

Tissue Complication Probability Modeling of Xerostomia. Frontiers in Oncology, 8.

https://doi.org/10.3389/fonc.2018.00035

Ganeshan, B., Abaleke, S., Young, R. C. D., Chatwin, C. R., & Miles, K. a. (2010). Texture

analysis of non-small cell lung cancer on unenhanced computed tomography: Initial

evidence for a relationship with tumour glucose metabolism and stage. Cancer Imaging,

10(1), 137–143. https://doi.org/10.1102/1470-7330.2010.0021

Ganeshan, B., Panayiotou, E., Burnand, K., Dizdarevic, S., & Miles, K. (2012). Tumour

heterogeneity in non-small cell lung carcinoma assessed by CT texture analysis: A potential

marker of survival. European Radiology, 22(4), 796–802. https://doi.org/10.1007/s00330-

011-2319-8

Gao, X., Lin, S., & Wong, T. Y. (2015). Automatic Feature Learning to Grade Nuclear Cataracts

Based on Deep Learning. IEEE Transactions on Biomedical Engineering, 62(11), 2693–

2701. https://doi.org/10.1109/TBME.2015.2444389

Gensheimer, M. F., & Narasimhan, B. A Scalable Discrete-Time Survival Model for Neural

Networks. Retrieved from http://github.com/MGensheimer/nnet-survival.

George, B., Seals, S., & Aban, I. (2014). Survival analysis and regression models. Journal of

Nuclear Cardiology : Official Publication of the American Society of Nuclear Cardiology,

21(4), 686–694. https://doi.org/10.1007/s12350-014-9908-2

George, D., Shen, H., & Huerta, E. A. (2017). Deep Transfer Learning: A new deep learning

glitch classification method for advanced LIGO. Retrieved from

http://arxiv.org/abs/1706.07446

110

Gillies, R. J., Kinahan, P. E., & Hricak, H. (2015). Radiomics: Images Are More than Pictures,

They Are Data. Radiology, 278(2), 151169. https://doi.org/10.1148/radiol.2015151169

Gold, D. V., Karanjawala, Z., Modrak, D. E., Goldenberg, D. M., & Hruban, R. H. (2007).

PAM4-Reactive MUC1 Is a Biomarker for Early Pancreatic Adenocarcinoma. Clinical

Cancer Research, 13(24), 7380–7387. https://doi.org/10.1158/1078-0432.CCR-07-1488

Gold, David V., Gaedcke, J., Ghadimi, B. M., Goggins, M., Hruban, R. H., Liu, M., …

Goldenberg, D. M. (2013). PAM4 enzyme immunoassay alone and in combination with CA

19-9 for the detection of pancreatic adenocarcinoma. Cancer, 119(3), 522–528.

https://doi.org/10.1002/cncr.27762

Gold, David V., Lew, K., Maliniak, R., Hernandez, M., & Cardillo, T. (1994). Characterization

of monoclonal antibody PAM4 reactive with a pancreatic cancer mucin. International

Journal of Cancer, 57(2), 204–210. https://doi.org/10.1002/ijc.2910570213

Goonetilleke, K. S., & Siriwardena, A. K. (2007). Systematic review of carbohydrate antigen

(CA 19-9) as a biochemical marker in the diagnosis of pancreatic cancer. European Journal

of Surgical Oncology (EJSO), 33(3), 266–270. https://doi.org/10.1016/j.ejso.2006.10.004

Gourgou-Bourgade, S., Bascoul-Mollevi, C., Desseigne, F., Ychou, M., Bouché, O., Guimbaud,

R., … Conroy, T. (2013). Impact of FOLFIRINOX Compared With Gemcitabine on

Quality of Life in Patients With Metastatic Pancreatic Cancer: Results From the PRODIGE

4/ACCORD 11 Randomized Trial. Journal of Clinical Oncology, 31(1), 23–29.

https://doi.org/10.1200/JCO.2012.44.4869

Greenhalf, W., Ghaneh, P., Neoptolemos, J. P., Palmer, D. H., Cox, T. F., Lamb, R. F., …

Büchler, M. W. (2014). Pancreatic Cancer hENT1 Expression and Survival From

Gemcitabine in Patients From the ESPAC-3 Trial. JNCI: Journal of the National Cancer

Institute, 106(1). https://doi.org/10.1093/jnci/djt347

Gu Kim, H., Choi, Y., & Man Ro, Y. (n.d.). Modality-bridge Transfer Learning for Medical

Image Classification. Retrieved from https://arxiv.org/ftp/arxiv/papers/1708/1708.03111.pdf

Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., … Webster, D.

R. (2016). Development and Validation of a Deep Learning Algorithm for Detection of

Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22), 2402.

https://doi.org/10.1001/jama.2016.17216

Haider, M. A., Vosough, A., Khalvati, F., Kiss, A., Ganeshan, B., & Bjarnason, G. A. (2017). CT

111

texture analysis: a potential tool for prediction of survival in patients with metastatic clear

cell carcinoma treated with sunitinib. Cancer Imaging, 17(1).

https://doi.org/10.1186/s40644-017-0106-8

Hambarde, P., Talbar, S. N., Sable, N., Mahajan, A., Chavan, S. S., & Thakur, M. (2019).

Radiomics for peripheral zone and intra-prostatic urethra segmentation in MR imaging.

Biomedical Signal Processing and Control, 51, 19–29.

https://doi.org/10.1016/J.BSPC.2019.01.024

Hawkins, S., Wang, H., Liu, Y., Garcia, A., Stringfield, O., Krewer, H., … Gillies, R. J. (2016).

Predicting Malignant Nodules from Screening CT Scans. Journal of Thoracic Oncology,

11(12), 2120–2128. https://doi.org/10.1016/j.jtho.2016.07.002

He, K., Girshick, R., & Dollár, P. (2018). Rethinking ImageNet Pre-training. Retrieved from

https://arxiv.org/abs/1811.08883

He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition (2015).


He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

https://doi.org/10.3389/fpsyg.2013.00124

Hertel, L., Barth, E., Käster, T., & Martinetz, T. (2017). Deep Convolutional Neural Networks as

Generic Feature Extractors. Retrieved from http://arxiv.org/abs/1710.02286

Hong, T. H., & Park, I. Y. (2014). MicroRNA expression profiling of diagnostic needle aspirates

from surgical pancreatic cancer specimens. Annals of Surgical Treatment and Research,

87(6), 290. https://doi.org/10.4174/astr.2014.87.6.290

Horn, Z. C., Auret, L., McCoy, J. T., Aldrich, C., & Herbst, B. M. (2017). Performance of

Convolutional Neural Networks for Feature Extraction in Froth Flotation Sensing. IFAC-

PapersOnLine, 50(2), 13–18. https://doi.org/10.1016/J.IFACOL.2017.12.003

Horvat, N., Veeraraghavan, H., Khan, M., Blazic, I., Zheng, J., Capanu, M., … Petkovska, I.

(2018). MR Imaging of Rectal Cancer: Radiomics Analysis to Assess Treatment Response

after Neoadjuvant Therapy. Radiology, 287(3), 833–843.


Hruban, R. H., Canto, M. I., Goggins, M., Schulick, R., & Klein, A. P. Update on familial

pancreatic cancer, 44 Advances in Surgery § (2010).

112

https://doi.org/10.1016/j.yasu.2010.05.011

Huang, Y.-Q., Liang, C.-H., He, L., Tian, J., Liang, C.-S., Chen, X., … Liu, Z.-Y. (2016).

Development and Validation of a Radiomics Nomogram for Preoperative Prediction of

Lymph Node Metastasis in Colorectal Cancer. Journal of Clinical Oncology : Official

Journal of the American Society of Clinical Oncology, 34(18), 2157–2164.

https://doi.org/10.1200/JCO.2015.65.9128

Huang, Y., Liu, Z., He, L., Chen, X., Pan, D., Ma, Z., … Liang, C. (2016). Radiomics Signature:

A Potential Biomarker for the Prediction of Disease-Free Survival in Early-Stage (I or II)

Non-Small Cell Lung Cancer. Radiology, 152234.


Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey

striate cortex. The Journal of Physiology, 195(1), 215–243.

https://doi.org/10.1113/jphysiol.1968.sp008455

Huxley, R., Ansary-Moghaddam, A., Berrington de González, A., Barzi, F., & Woodward, M.

(2005). Type-II diabetes and pancreatic cancer: A meta-analysis of 36 studies. British

Journal of Cancer, 92(11), 2076–2083. https://doi.org/10.1038/sj.bjc.6602619

Huynh, E., Coroller, T. P., Narayan, V., Agrawal, V., Hou, Y., Romano, J., … Aerts, H. J. W. L.

(2016). CT-based radiomic analysis of stereotactic body radiation therapy patients with lung

cancer. Radiotherapy and Oncology, 120(2), 258–266.


Ilic, M., & Ilic, I. (2016). Epidemiology of pancreatic cancer. World Journal of

Gastroenterology, 22(44), 9694–9705. https://doi.org/10.3748/wjg.v22.i44.9694

Infante, J. R., Matsubayashi, H., Sato, N., Tonascia, J., Klein, A. P., Riall, T. A., … Goggins, M.

(2007). Peritumoral Fibroblast SPARC Expression and Patient Outcome With Resectable

Pancreatic Adenocarcinoma. Journal of Clinical Oncology, 25(3), 319–325.

https://doi.org/10.1200/JCO.2006.07.8824

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., … Ng, A. Y. (n.d.).

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert

Comparison. Retrieved from www.aaai.org

Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., & Maier-Hein, K. H. (n.d.). Brain

Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017

113

Challenge. Retrieved from https://arxiv.org/pdf/1802.10508.pdf

Itakura, H., Achrol, A. S., Mitchell, L. A., Loya, J. J., Liu, T., Westbroek, E. M., … Gevaert, O.

(2015). Magnetic resonance image features identify glioblastoma phenotypic subtypes with

distinct molecular pathway activities. Science Translational Medicine, 7(303), 303ra138-

303ra138. https://doi.org/10.1126/scitranslmed.aaa7582

Jiang, X.-T., Tao, H.-Q., & Zou, S.-C. (2004). Detection of serum tumor markers in the

diagnosis and treatment of patients with pancreatic cancer. Hepatobiliary and Pancreatic

Diseases International, 3(3), 464–468.

Junfeng, D., & Yunyang, Y. (2012). The Fast Medical Image Segmentation of Target Region

Based on Improved FM Algorithm. Procedia Engineering, 29, 48–52.

https://doi.org/10.1016/J.PROENG.2011.12.666

Kamisawa, T., Wood, L. D., Itoi, T., & Takaori, K. (2016). Pancreatic cancer. The Lancet,

388(10039), 73–85. https://doi.org/10.1016/S0140-6736(16)00141-0

Kattan, M. W., Hess, K. R., & Beck, J. R. (1998). Experiments to Determine Whether Recursive

Partitioning (CART) or an Artificial Neural Network Overcomes Theoretical Limitations of

Cox Proportional Hazards Regression. Computers and Biomedical Research, 31(5), 363–

373. https://doi.org/10.1006/CBMR.1998.1488

Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., & Kluger, Y. (2016). DeepSurv:

Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep

Neural Network. https://doi.org/10.1186/s12874-018-0482-1

Kaur, R., & Kaur, J. (2014). Current Methods in Medical Image Segmentation: A Review.

International Conference on Computer Communication and Systems - ICCCS 2014.

Keek, S. A., Leijenaar, R. T., Jochems, A., & Woodruff, H. C. (2018). A review on radiomics

and the future of theranostics for patient selection in precision medicine. The British

Journal of Radiology, 91(1091), 20170926. https://doi.org/10.1259/bjr.20170926

Khalvati, F., Zhang, Y., Baig, S., Lobo-Mueller, E. M., Karanicolas, P., Gallinger, S., & Haider,

M. A. (2019). Prognostic Value of CT Radiomic Features in Resectable Pancreatic Ductal

Adenocarcinoma. Scientific Reports, 9(1), 5449. https://doi.org/10.1038/s41598-019-41728-

7

Khalvati, F., Zhang, Y., Wong, A., & Haider, M. A. (2019). Radiomics. In Encyclopedia of

Biomedical Engineering (Vol. 2, pp. 597–603). https://doi.org/10.1016/B978-0-12-801238-

114

3.99964-1

Kickingereder, P., Neuberger, U., Bonekamp, D., Piechotta, P. L., Götz, M., Wick, A., …

Bendszus, M. (2018). Radiomic subtyping improves disease stratification beyond key

molecular, clinical, and standard imaging characteristics in patients with glioblastoma.

Neuro-Oncology, 20(6), 848–857. https://doi.org/10.1093/neuonc/nox188

Kim, E., Corte-Real, M., & Baloch, Z. (2016). A deep semantic mobile application for thyroid

cytopathology. In Jianguo Zhang & T. S. Cook (Eds.) (Vol. 9789, p. 97890A). International

Society for Optics and Photonics. https://doi.org/10.1117/12.2216468

Kishikawa, T. (2015). Circulating RNAs as new biomarkers for detecting pancreatic cancer.

World Journal of Gastroenterology, 21(28), 8527. https://doi.org/10.3748/wjg.v21.i28.8527

Klawikowski, S., Christian, J., Schott, D., Zhang, M., & Li, X. (2016). Development of a CT-

Radiomics Based Early Response Prediction Model During Delivery of Chemoradiation

Therapy for Pancreatic Cancer. Medical Physics, 43(6), 3350–3350.

https://doi.org/10.1118/1.4955675

Kooi, T., Litjens, G., van Ginneken, B., Gubern-Mérida, A., Sánchez, C. I., Mann, R., …

Karssemeijer, N. (2017). Large scale deep learning for computer aided detection of

mammographic lesions. Medical Image Analysis, 35, 303–312. Retrieved from

https://linkinghub.elsevier.com/retrieve/pii/S1361841516301244

Koopmann, J. (2006). Serum Markers in Patients with Resectable Pancreatic Adenocarcinoma:

Macrophage Inhibitory Cytokine 1 versus CA19-9. Clinical Cancer Research, 12(2), 442–

446. https://doi.org/10.1158/1078-0432.CCR-05-0564

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Classification with Deep Convolutional

Neural Networks. Retrieved from http://code.google.com/p/cuda-convnet/

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal Of

Statistical Software, 28(5), 1–26. https://doi.org/10.1053/j.sodo.2009.03.002

Kumar, D., Shafiee, M. J., Chung, A. G., Khalvati, F., Haider, M. A., & Wong, A. (2015).

Discovery Radiomics for Pathologically-Proven Computed Tomography Lung Cancer

Prediction. Retrieved from http://arxiv.org/abs/1509.00117

Kumar, V., Gu, Y., Basu, S., Berglund, A., Eschrich, S. A., Schabath, M. B., … Gillies, R. J.

(2013). Radiomics: The Process and the Challenges, 30(9), 1234–1248.

https://doi.org/10.1016/j.mri.2012.06.010.QIN

115

Kung, J. T. Y., Colognori, D., & Lee, J. T. (2013). Long Noncoding RNAs: Past, Present, and

Future. Genetics, 193(3), 651–669. https://doi.org/10.1534/genetics.112.146704

Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of

Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036.i11

Lagos-Quintana, M. (2001). Identification of Novel Genes Coding for Small Expressed RNAs.

Science, 294(5543), 853–858. https://doi.org/10.1126/science.1064921

Lai, Z., & Deng, H. (2018). Medical Image Classification Based on Deep Features Extracted by

Deep Model and Statistic Feature Fusion with Multilayer Perceptron. Computational

Intelligence and Neuroscience, 2018, 2061516. https://doi.org/10.1155/2018/2061516

Lakhani, P., & Sundaram, B. (2017). Deep Learning at Chest Radiography: Automated

Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks.

Radiology, 284(2), 574–582. https://doi.org/10.1148/radiol.2017162326

Lambin, P., Leijenaar, R. T. H., Deist, T. M., Peerlings, J., de Jong, E. E. C., van Timmeren,

J., … Walsh, S. (2017a). Radiomics: the bridge between medical imaging and personalized

medicine. Nature Reviews Clinical Oncology, 14(12), 749–762.

https://doi.org/10.1038/nrclinonc.2017.141

Lambin, P., Leijenaar, R. T. H., Deist, T. M., Peerlings, J., de Jong, E. E. C., van Timmeren,

J., … Walsh, S. (2017b). Radiomics: the bridge between medical imaging and personalized

medicine. Nature Reviews Clinical Oncology, 14(12), 749–762.

https://doi.org/10.1038/nrclinonc.2017.141

Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., van Stiphout, R. G. P. M., Granton,

P., … Aerts, H. J. W. L. (2012). Radiomics: Extracting more information from medical

images using advanced feature analysis. European Journal of Cancer, 48(4), 441–446.

https://doi.org/10.1016/j.ejca.2011.11.036

Lao, J., Chen, Y., Li, Z.-C., Li, Q., Zhang, J., Liu, J., & Zhai, G. (2017). A Deep Learning-Based

Radiomics Model for Prediction of Survival in Glioblastoma Multiforme. Scientific Reports,

7(1), 10353. https://doi.org/10.1038/s41598-017-10649-8

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

https://doi.org/10.1038/nature14539

LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel,

L. D. (1990). Handwritten Digit Recognition with a Back-Propagation Network. Retrieved

116

from https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-

propagation-network

Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

https://doi.org/10.1109/5.726791

Leo, C. S., Lim, C. C. T., & Suneetha, V. (2009). An Automated Segmentation Algorithm for

Medical Images. In 13th International Conference on Biomedical Engineering (pp. 109–

111). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-

92841-6_27

Li, H., Zhu, Y., Burnside, E. S., Drukker, K., Hoadley, K. A., Fan, C., … Giger, M. L. (2016).

MR Imaging Radiomics Signatures for Predicting the Risk of Breast Cancer Recurrence as

Given by Research Versions of MammaPrint, Oncotype DX, and PAM50 Gene Assays.

Radiology, 281(2), 382–391. https://doi.org/10.1148/radiol.2016152110

Li, Yiming, Liu, X., Qian, Z., Sun, Z., Xu, K., Wang, K., … Jiang, T. (2018). Genotype

prediction of ATRX mutation in lower-grade gliomas using an MRI radiomics signature.

European Radiology, 28(7), 2960–2968. https://doi.org/10.1007/s00330-017-5267-0

Li, Yiming, Qian, Z., Xu, K., Wang, K., Fan, X., Li, S., … Wang, Y. (2018). MRI features

predict p53 status in lower-grade gliomas via a machine-learning approach. NeuroImage:

Clinical, 17, 306–311. https://doi.org/10.1016/j.nicl.2017.10.030

Li, Yuexiang, Shen, L., Li, Y., & Shen, L. (2018). Skin Lesion Analysis towards Melanoma

Detection Using Deep Learning Network. Sensors, 18(2), 556.

https://doi.org/10.3390/s18020556

Liao, Q., Zhao, Y.-P., Yang, Y.-C., Li, L.-J., Long, X., & Han, S.-M. (2007). Combined

detection of serum tumor markers for differential diagnosis of solid lesions located at the

pancreatic head. Hepatobiliary and Pancreatic Diseases International, 6(6), 641–645.

Link, A., Becker, V., Goel, A., Wex, T., & Malfertheiner, P. (2012). Feasibility of Fecal

MicroRNAs as Novel Biomarkers for Pancreatic Cancer. PLoS ONE, 7(8), e42933.


Litjens, G., Kooi, T., Bejnordi, B. E., Arindra, A., Setio, A., Ciompi, F., … Sánchez, C. I.

(2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42,

60–88. https://doi.org/10.1016/j.media.2017.07.005

117

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., … Sánchez,

C. I. (2017). A Survey on Deep Learning in Medical Image Analysis.

https://doi.org/10.1016/j.media.2017.07.005

Liu, B., Wei, Y., Zhang, Y., Yang, Q., & Kong, H. (2017). Deep Neural Networks for High

Dimension, Low Sample Size Data. Retrieved from

https://www.ijcai.org/proceedings/2017/0318.pdf

Liu, D., Chang, C.-H., Gold, D. V., & Goldenberg, D. M. (2015). Identification of PAM4

(clivatuzumab)-reactive epitope on MUC5AC: A promising biomarker and therapeutic

target for pancreatic cancer. Oncotarget, 6(6). https://doi.org/10.18632/oncotarget.2760

Liu, H., Li, B., Lv, X., & Huang, Y. (2017). Image Retrieval Using Fused Deep Convolutional

Features. Procedia Computer Science, 107(Icict), 749–754.

https://doi.org/10.1016/j.procs.2017.03.159

Liu, Y., Balagurunathan, Y., Atwater, T., Antic, S., Li, Q., Walker, R. C., … Gillies, R. J.

(2017). Radiological Image Traits Predictive of Cancer Status in Pulmonary Nodules.

Clinical Cancer Research, 23(6), 1442–1449. https://doi.org/10.1158/1078-0432.CCR-15-

3102

Lo, S.-C. B., Lou, S.-L. A., Jyh-Shyan Lin, J. S., Freedman, M. T., Chien, M. V., & Mun, S. K.

(1995). Artificial convolution neural network techniques and applications for lung nodule

detection. IEEE Transactions on Medical Imaging, 14(4), 711–718.

https://doi.org/10.1109/42.476112

Loosen, S. H., Neumann, U. P., Trautwein, C., Roderburg, C., & Luedde, T. (2017). Current and

future biomarkers for pancreatic adenocarcinoma. Tumor Biology, 39(6),

101042831769223. https://doi.org/10.1177/1010428317692231

Louvet, C., Labianca, R., Hammel, P., Lledo, G., Zampino, M. G., André, T., … de Gramont, A.

(2005). Gemcitabine in Combination With Oxaliplatin Compared With Gemcitabine Alone

in Locally Advanced or Metastatic Pancreatic Cancer: Results of a GERCOR and GISCAD

Phase III Trial. Journal of Clinical Oncology, 23(15), 3509–3516.

https://doi.org/10.1200/JCO.2005.06.023

Luo, G., Jin, K., Guo, M., Cheng, H., Liu, Z., Xiao, Z., … Yu, X. (2017). Patients with normal-

range CA19-9 levels represent a distinct subgroup of pancreatic cancer patients. Oncology

Letters, 13(2), 881. https://doi.org/10.3892/OL.2016.5501

118

Luo, J., Xiao, L., Wu, C., Zheng, Y., & Zhao, N. (2013). The Incidence and Survival Rate of

Population-Based Pancreatic Cancer Patients: Shanghai Cancer Registry 2004-2009. PLoS

ONE, 8(10), e76052. https://doi.org/10.1371/journal.pone.0076052

Lynch, S. M., Vrieling, A., Lubin, J. H., Kraft, P., Mendelsohn, J. B., Hartge, P., … Stolzenberg-

Solomon, R. Z. (2009). Cigarette Smoking and Pancreatic Cancer: A Pooled Analysis From

the Pancreatic Cancer Cohort Consortium. American Journal of Epidemiology, 170(4), 403–

413. https://doi.org/10.1093/aje/kwp134

Maas, M., Nelemans, P. J., Valentini, V., Das, P., Rödel, C., Kuo, L.-J., … Beets, G. L. (2010).

Long-term outcome in patients with a pathological complete response after chemoradiation

for rectal cancer: a pooled analysis of individual patient data. The Lancet Oncology, 11(9),

835–844. https://doi.org/10.1016/S1470-2045(10)70172-8

Mangai, U., Samanta, S., Das, S., & Chowdhury, P. (2010). A Survey of Decision Fusion and

Feature Fusion Strategies for Pattern Classification. IETE Technical Review, 27(4), 293.

https://doi.org/10.4103/0256-4602.64604

Marechal, R., Mackey, J. R., Lai, R., Demetter, P., Peeters, M., Polus, M., … Van Laethem, J.-L.

(2009). Human Equilibrative Nucleoside Transporter 1 and Human Concentrative

Nucleoside Transporter 3 Predict Survival after Adjuvant Gemcitabine Therapy in Resected

Pancreatic Adenocarcinoma. Clinical Cancer Research, 15(8), 2913–2919.

https://doi.org/10.1158/1078-0432.CCR-08-2080

Mariani, L., Coradini, D., Biganzoli, E., Boracchi, P., Marubini, E., Pilotti, S., … Rilke, F.

(1997). Prognostic factors for metachronous contralateral breast cancer: a comparison of the

linear Cox regression model and its artificial neural network extension. Breast Cancer

Research and Treatment, 44(2), 167–178. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/9232275

Matthias Gamer, A., & Matthias Gamer, M. (2015). Package “irr.” Retrieved from

http://www.r-project.org

Mazurowski, M. A. (2015). Radiogenomics: What It Is and Why It Is Important. Journal of the

American College of Radiology, 12(8), 862–866. https://doi.org/10.1016/j.jacr.2015.04.019

McGuigan, A., Kelly, P., Turkington, R. C., Jones, C., Coleman, H. G., & McCain, R. S. (2018).

Pancreatic cancer: A review of clinical diagnosis, epidemiology, treatment and outcomes.

World Journal of Gastroenterology, 24(43), 4846–4861.

119

https://doi.org/10.3748/wjg.v24.i43.4846

Memba, R., Duggan, S. N., Ni Chonchubhair, H. M., Griffin, O. M., Bashir, Y., O’Connor, D.

B., … Conlon, K. C. (2017). The potential role of gut microbiota in pancreatic disease: A

systematic review. Pancreatology, 17(6), 867–874.

https://doi.org/10.1016/j.pan.2017.09.002

Men, K., Dai, J., & Li, Y. (2017). Automatic segmentation of the clinical target volume and

organs at risk in the planning CT for rectal cancer using deep dilated convolutional neural

networks. Medical Physics, 44(12), 6377–6389. https://doi.org/10.1002/mp.12602

Menegola, A., Fornaciali, M., Pires, R., Avila, S., & Valle, E. (2016). Towards Automated

Melanoma Screening: Exploring Transfer Learning Schemes. Retrieved from

https://lasagne.readthedocs.io/en/latest/

Meng, Y., Zhang, Y., Dong, D., Li, C., Liang, X., Zhang, C., … Zhang, H. (2018). Novel

radiomic signature as a prognostic biomarker for locally advanced rectal cancer. Journal of

Magnetic Resonance Imaging, 48(3), 605–614. https://doi.org/10.1002/jmri.25968

Midha, S., Chawla, S., & Garg, P. K. (2016). Modifiable and non-modifiable risk factors for

pancreatic cancer: A review. Cancer Letters, 381(1), 269–277.

https://doi.org/10.1016/j.canlet.2016.07.022

Morin, O. (2018). A Deep Look Into the Future of Quantitative Imaging in Oncology: A

Statement of Working Principles and Proposal for Change. International Journal of

Radiation Oncology*Biology*Physics, 102(4), 1074–1082.

https://doi.org/10.1016/J.IJROBP.2018.08.032

Mukaka, M. M. (2012). Statistics corner: A guide to appropriate use of correlation coefficient in

medical research. Malawi Medical Journal : The Journal of Medical Association of Malawi,

24(3), 69–71. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/23638278

Nguyen, K., Haytmyradov, M., Mostafavi, H., Patel, R., Surucu, M., Block, A., … Roeske, J. C.

(2018). Evaluation of Radiomics to Predict the Accuracy of Markerless Motion Tracking of

Lung Tumors: A Preliminary Study. Frontiers in Oncology, 8, 292.

https://doi.org/10.3389/fonc.2018.00292

Nie, K., Shi, L., Chen, Q., Hu, X., Jabbour, S. K., Yue, N., … Sun, X. (2016). Rectal Cancer:

Assessment of Neoadjuvant Chemoradiation Outcome based on Radiomics of

Multiparametric MRI. Clinical Cancer Research, 22(21), 5256–5264.

120

https://doi.org/10.1158/1078-0432.CCR-15-2997

Nikolov, S., Blackwell, S., Mendes, R., De Fauw, J., Meyer, C., Hughes, C., … Ronneberger, O.

(2018). Deep learning to achieve clinically applicable segmentation of head and neck

anatomy for radiotherapy. Retrieved from http://arxiv.org/abs/1809.04430

Oda, M., Shimizu, N., Oda, H., Hayashi, Y., Kitasaka, T., Fujiwara, M., … Roth, H. R. (2018).

Towards dense volumetric pancreas segmentation in CT using 3D fully convolutional

networks. In E. D. Angelini & B. A. Landman (Eds.), Medical Imaging 2018: Image

Processing (Vol. 10574, p. 10). SPIE. https://doi.org/10.1117/12.2293499

Oda, M., Shimizu, N., Roth, H. R., Karasawa, ichi, Kitasaka, T., Misawa, K., … Mori, K.

(2018). 3D FCN Feature Driven Regression Forest-Based Pancreas Localization and

Segmentation. Retrieved from https://arxiv.org/pdf/1806.03019.pdf

Oikonomou, A., Khalvati, F., & et al. (2018). Radiomics analysis at PET/CT contributes to

prognosis of recurrence and survival in lung cancer treated with stereotactic body

radiotherapy. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-22357-y

Oken, M. M., Creech, R. H., Tormey, D. C., Horton, J., Davis, T. E., McFadden, E. T., &

Carbone, P. P. (1982). Toxicity and response criteria of the Eastern Cooperative Oncology

Group. American Journal of Clinical Oncology, 5(6), 649–655. Retrieved from


Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., … Rueckert, D.

(n.d.). Attention U-Net: Learning Where to Look for the Pancreas. Retrieved from

https://arxiv.org/pdf/1804.03999.pdf

Owens, C. A., Peterson, C. B., Tang, C., Koay, E. J., Yu, W., Mackin, D. S., … Yang, J. (2018).

Lung tumor segmentation methods: Impact on the uncertainty of radiomics features for non-

small cell lung cancer. PloS One, 13(10), e0205003.


Pan, S. J., & Yang, Q. (2009). A Survey on Transfer Learning.

https://doi.org/10.1109/TKDE.2009.191

Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge

and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191

Papp, L., Pötsch, N., Grahovac, M., Schmidbauer, V., Woehrer, A., Preusser, M., … Traub-

Weidinger, T. (2018). Glioma Survival Prediction with Combined Analysis of In Vivo 11 C-

121

MET PET Features, Ex Vivo Features, and Patient Features by Supervised Machine

Learning. Journal of Nuclear Medicine, 59(6), 892–899.

https://doi.org/10.2967/jnumed.117.202267

Parekh, V., & Jacobs, M. A. (2016). Radiomics: a new application from established techniques.

Expert Review of Precision Medicine and Drug Development, 1(2), 207–226.

https://doi.org/10.1080/23808993.2016.1164013

Parmar, C., Grossmann, P., Bussink, J., Lambin, P., & Aerts, H. J. W. L. (2015). Machine

Learning methods for Quantitative Radiomic Biomarkers. Scientific Reports, 5, 13087.

https://doi.org/10.1038/srep13087

Parmar, C., Leijenaar, R. T. H., Grossmann, P., Rios Velazquez, E., Bussink, J., Rietveld, D., …

Aerts, H. J. W. L. W. L. (2015). Radiomic feature clusters and Prognostic Signatures

specific for Lung and Head & Neck cancer. Sci. Rep., 5(1), 1–10.

https://doi.org/10.1038/srep11044\rhttp://www.nature.com/srep/2015/150605/srep11044/ab

s/srep11044.html#supplementary-information

Peixoto, R. D., Speers, C., McGahan, C. E., Renouf, D. J., Schaeffer, D. F., & Kennecke, H. F.

(2015). Prognostic factors and sites of metastasis in unresectable locally advanced

pancreatic cancer. Cancer Medicine, 4(8), 1171–1177. https://doi.org/10.1002/cam4.459

Pérez-Beteta, J., Molina-García, D., Ortiz-Alhambra, J. A., Fernández-Romero, A., Luque, B.,

Arregui, E., … Pérez-García, V. M. (2018). Tumor Surface Regularity at MR Imaging

Predicts Survival and Response to Surgery in Patients with Glioblastoma. Radiology,

288(1), 218–225. https://doi.org/10.1148/radiol.2018171051

Perkins, G. L., Slater, E. D., Sanders, G. K., & Prichard, J. G. (2003). Serum tumor markers.

American Family Physician, 68(6), 1075–1082.

Permuth-Wey, J., & Egan, K. M. (2009). Family history is a significant risk factor for pancreatic

cancer: Results from a systematic review and meta-analysis. Familial Cancer, 8(2), 109–

117. https://doi.org/10.1007/s10689-008-9214-8

Pernick, N. L., Sarkar, F. H., Philip, P. A., Arlauskas, P., Shields, A. F., Vaitkevicius, V. K., …

Adsay, N. V. (2003). Clinicopathologic Analysis of Pancreatic Adenocarcinoma in African

Americans and Caucasians. Pancreas, 26(1), 28–32. https://doi.org/10.1097/00006676-

200301000-00006

Pratt, H., Coenen, F., Broadbent, D. M., Harding, S. P., & Zheng, Y. (2016). Convolutional

122

Neural Networks for Diabetic Retinopathy. Procedia Computer Science, 90, 200–205.

https://doi.org/10.1016/J.PROCS.2016.07.014

Ravishankar, H., Sudhakar, P., Venkataramani, R., Thiruvenkadam, S., Annangi, P., Babu, N., &

Vaidya, V. Understanding the Mechanisms of Deep Transfer Learning for Medical Images.

https://doi.org/10.1007/978-3-319-46976-8_20

Razzak, M. I., Naz, S., & Zaib, A. (n.d.). Deep Learning for Medical Image Processing:

Overview, Challenges and Future. Retrieved from https://arxiv.org/pdf/1704.06825.pdf

Ren, J., Tian, J., Yuan, Y., Dong, D., Li, X., Shi, Y., & Tao, X. (2018). Magnetic resonance

imaging based radiomics signature for the preoperative discrimination of stage I-II and III-

IV head and neck squamous cell carcinoma. European Journal of Radiology, 106, 1–6.

https://doi.org/10.1016/j.ejrad.2018.07.002

Rohrmann, S., Linseisen, J., Vrieling, A., Boffetta, P., Stolzenberg-Solomon, R. Z., Lowenfels,

A. B., … Bueno-de-Mesquita, H. B. (2009). Ethanol intake and the risk of pancreatic cancer

in the European prospective investigation into cancer and nutrition (EPIC). Cancer Causes

& Control, 20(5), 785–794. https://doi.org/10.1007/s10552-008-9293-8

Rokach, L. (2005). Ensemble Methods for Classifiers. In Data Mining and Knowledge Discovery

Handbook (pp. 957–980). New York: Springer-Verlag. https://doi.org/10.1007/0-387-

25465-X_45

Ronneberger, O., Fischer, P., & Brox, T. U-Net: Convolutional Networks for Biomedical Image

Segmentation (2015). https://doi.org/10.1007/978-3-319-24574-4_28

Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M., … Barshack,

I. (2008). MicroRNAs accurately identify cancer tissue origin. Nature Biotechnology, 26(4),

462–469. https://doi.org/10.1038/nbt1392

Roth, H. R., Oda, H., Hayashi, Y., Oda, M., Shimizu, N., Fujiwara, M., … Roth, H. R. (2017).

Hierarchical 3D fully convolutional networks for multi-organ segmentation Hierarchical

3D fully convolutional networks. Retrieved from http://lmb.informatik.uni-

freiburg.de/resources/opensource/unet.en.html

Sadakari, Y., Ohtsuka, T., Ohuchida, K., Tsutsumi, K., Takahata, S., Nakamura, M., … Tanaka,

M. (2010). MicroRNA expression analyses in preoperative pancreatic juice samples of

pancreatic ductal adenocarcinoma. Journal of the Pancreas, 11(6), 587–592.

Sanduleanu, S., Woodruff, H. C., de Jong, E. E. C., van Timmeren, J. E., Jochems, A., Dubois,

123

L., & Lambin, P. (2018, June 1). Tracking tumor biology with radiomics: A systematic

review utilizing a radiomics quality score. Radiotherapy and Oncology. Elsevier.


Sanghera, P., Wong, D. W. Y., McConkey, C. C., Geh, J. I., & Hartley, A. (2008).

Chemoradiotherapy for Rectal Cancer: An Updated Analysis of Factors Affecting

Pathological Response. Clinical Oncology, 20(2), 176–183.

https://doi.org/10.1016/j.clon.2007.11.013

Sargent, D. J. (2001). Comparison of artificial neural networks with other statistical approaches:

results from medical data sets. Cancer, 91(8 Suppl), 1636–1642. Retrieved from


Satake, K., Kanazawa, G., Kho, I., Chung, Y. ‐s., & Umeyama, K. (1985). Evaluation of Serum

Pancreatic Enzymes, Carbohydrate Antigen 19‐9, and Carcinoembryonic Antigen in

Various Pancreatic Diseases. The American Journal of Gastroenterology, 80(8), 630–636.

https://doi.org/10.1111/j.1572-0241.1985.tb02191.x

Schmid, M., Wright, M. N., & Ziegler, A. (2016). On the use of Harrell’s C for clinical risk

prediction via random survival forests.

Schmidhuber, J. Deep Learning in Neural Networks: An Overview (2014). Retrieved from

http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.texCompleteBIBTEXfile

Schultz, N. A., Werner, J., Willenbrock, H., Roslind, A., Giese, N., Horn, T., … Johansen, J. S.

(2012). MicroRNA expression profiles associated with pancreatic adenocarcinoma and

ampullary adenocarcinoma. Modern Pathology, 25(12), 1609–1622.

https://doi.org/10.1038/modpathol.2012.122

Sebastiani, V. (2006). Immunohistochemical and Genetic Evaluation of Deoxycytidine Kinase in

Pancreatic Cancer: Relationship to Molecular Mechanisms of Gemcitabine Resistance and

Survival. Clinical Cancer Research, 12(8), 2492–2497. https://doi.org/10.1158/1078-

0432.CCR-05-2655

Sedgwick, P. (2012). Pearson’s correlation coefficient. BMJ, 345(jul04 1), e4483–e4483.

https://doi.org/10.1136/bmj.e4483

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2016). Grad-

CAM: Visual Explanations from Deep Networks via Gradient-based Localization.


124

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-

CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In 2017

IEEE International Conference on Computer Vision (ICCV) (pp. 618–626). IEEE.

https://doi.org/10.1109/ICCV.2017.74

Sharma, N., Ray, A., Shukla, K., Sharma, S., Pradhan, S., Srivastva, A., & Aggarwal, L. (2010).

Automated medical image segmentation techniques. Journal of Medical Physics, 35(1), 3.

https://doi.org/10.4103/0971-6203.58777

Siegel, R. L., Miller, K. D., & Jemal, A. (2015). Cancer statistics, 2015. CA: A Cancer Journal

for Clinicians, 65(1), 5–29. https://doi.org/10.3322/caac.21254

Siegel, R. L., Miller, K. D., Jemal, A., Rahib, L., Smith, B. D., Aizenberg, R., … Smith-Warner,

S. A. (2009). Trends in pancreatic adenocarcinoma incidence and mortality in the United

States in the last four decades; A SEER-based study. Cancer Epidemiology Biomarkers and

Prevention, 18(1), 742–746. https://doi.org/10.1097/00006676-200301000-00006

Silverman, D. T., Hoover, R. N., Brown, L. M., Swanson, G. M., Schiffman, M., Greenberg, R.

S., … Fraumeni, J. F. (2003). Why Do Black Americans Have a Higher Risk of Pancreatic

Cancer than White Americans? Epidemiology, 14(1), 45–54.

https://doi.org/10.1097/00001648-200301000-00013

Sørensen, J., Klee, M., Palshof, T., & Hansen, H. (1993). Performance status assessment in

cancer patients. An inter-observer variability study. British Journal of Cancer, 67(4), 773–

775. https://doi.org/10.1038/bjc.1993.140

Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia Medica, 24(1), 12–

18. https://doi.org/10.11613/BM.2014.003

Spratlin, J. (2004). The Absence of Human Equilibrative Nucleoside Transporter 1 Is Associated

with Reduced Survival in Patients With Gemcitabine-Treated Pancreas Adenocarcinoma.

Clinical Cancer Research, 10(20), 6956–6961. https://doi.org/10.1158/1078-0432.CCR-04-

0224

Stark, A. P., Sacks, G. D., Rochefort, M. M., Donahue, T. R., Reber, H. A., Tomlinson, J. S., …

Hines, O. J. (2016). Long-term survival in patients with pancreatic ductal adenocarcinoma.

Surgery, 159(6), 1520–1527. https://doi.org/10.1016/j.surg.2015.12.024

Steinberg, W. (1990). The clinical utility of the CA 19-9 tumor-associated antigen. American

Journal of Gastroenterology, 85(4), 350–355.

125

Stevens, R. J., Roddam, A. W., & Beral, V. (2007). Pancreatic cancer in type 1 and young-onset

diabetes: Systematic review and meta-analysis. British Journal of Cancer, 96(3), 507–509.

https://doi.org/10.1038/sj.bjc.6603571

Sun, Q.-S., Zeng, S.-G., Liu, Y., Heng, P.-A., & Xia, D.-S. (2005). A new method of feature

fusion and its application in image recognition. Pattern Recognition, 38(12), 2437–2448.

https://doi.org/10.1016/J.PATCOG.2004.12.013

Szafranska, A. E., Davison, T. S., John, J., Cannon, T., Sipos, B., Maghnouj, A., … Hahn, S. A.

(2007). MicroRNA expression alterations are linked to tumorigenesis and non-neoplastic

processes in pancreatic ductal adenocarcinoma. Oncogene, 26(30), 4442–4452.

https://doi.org/10.1038/sj.onc.1210228

Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway, M. B., & Liang,

J. (2017). Convolutional Neural Networks for Medical Image Analysis: Full Training or

Fine Tuning? IEEE Transactions on Medical Imaging, 35(5), 1299–1312.

https://doi.org/10.1109/TMI.2016.2535302

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. (2018). A Survey on Deep Transfer

Learning. Retrieved from http://arxiv.org/abs/1808.01974

Terry, M., & Therneau, M. (2018). Package “survival.” Retrieved from

https://github.com/therneau/survival

Thomaz, R. L., Carneiro, P. C., & Patrocinio, A. C. (2017). Feature extraction using

convolutional neural network for classifying breast density in mammographic images. In S.

G. Armato & N. A. Petrick (Eds.) (Vol. 10134, p. 101342M). International Society for

Optics and Photonics. https://doi.org/10.1117/12.2254633

Tibshirani, R. (1997). The lasso method for variable selection in the cox model. STATISTICS IN

MEDICINE, 16(December 1995), 385–395. https://doi.org/10.1002/(SICI)1097-

0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3

Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature

ranking and solutions. Bioinformatics, 27(14), 1986–1994.

https://doi.org/10.1093/bioinformatics/btr300

Torrey, L., & Shavlik, J. (n.d.). Transfer Learning. Retrieved from

http://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf

Traverso, A., Wee, L., Dekker, A., & Gillies, R. (2018). Repeatability and Reproducibility of

126

Radiomic Features: A Systematic Review. International Journal of Radiation

Oncology*Biology*Physics, 102(4), 1143–1158.

https://doi.org/10.1016/J.IJROBP.2018.05.053

Urruticoechea, A., Alemany, R., Balart, J., Villanueva, A., Viñals, F., & Capellá, G. (2010).

Recent advances in cancer therapy: an overview. Current Pharmaceutical Design, 16(1), 3–

10. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/20214614

van Griethuysen, J. J. M., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., …

Aerts, H. J. W. L. (2017). Computational Radiomics System to Decode the Radiographic

Phenotype. Cancer Research, 77(21), e104–e107. https://doi.org/10.1158/0008-5472.CAN-

17-0339

van Rossum, P. S. N., Fried, D. V., Zhang, L., Hofstetter, W. L., van Vulpen, M., Meijer, G.

J., … Lin, S. H. (2016). The Incremental Value of Subjective and Quantitative Assessment

of 18F-FDG PET for the Prediction of Pathologic Complete Response to Preoperative

Chemoradiotherapy in Esophageal Cancer. Journal of Nuclear Medicine, 57(5), 691–700.

https://doi.org/10.2967/jnumed.115.163766

Von Rosen, A., Linder, S., Harmenberg, U., & Pegert, S. (1993). Serum levels of CA 19-9 and

CA 50 in relation to lewis blood cell status in patients with malignant and benign pancreatic

disease. Pancreas, 8(2), 160–165.

Waddell, N., Pajic, M., Patch, A.-M., Chang, D. K., Kassahn, K. S., Bailey, P., … Grimmond, S.

M. (2015). Whole genomes redefine the mutational landscape of pancreatic cancer. Nature,

518(7540), 495–501. https://doi.org/10.1038/nature14169

Wahi, M. M., Shah, N., Schrock, C. E., Rosemurgy, A. S., & Goldin, S. B. (2009). Reproductive

Factors and Risk of Pancreatic Cancer in Women: A Review of the Literature. Annals of

Epidemiology, 19(2), 103–111. https://doi.org/10.1016/j.annepidem.2008.11.003

Wang, C.-S., Lin, K.-H., Chen, S.-L., Chan, Y.-F., & Hsueh, S. (2004). Overexpression of

SPARC gene in human gastric carcinoma and its clinic–pathologic significance. British

Journal of Cancer, 91(11), 1924–1930. https://doi.org/10.1038/sj.bjc.6602213

Wang, H., Guo, X.-H., Jia, Z.-W., Li, H.-K., Liang, Z.-G., Li, K.-C., & He, Q. (2010). Multilevel

binomial logistic prediction model for malignant pulmonary nodules based on texture

features of CT image. European Journal of Radiology, 74(1), 124–129.

https://doi.org/10.1016/j.ejrad.2009.01.024

127

Wang, X., Li, Y., Tian, H., Qi, J., Li, M., Fu, C., … Zhang, W. (2014). Macrophage inhibitory

cytokine 1 (MIC-1/GDF15) as a novel diagnostic serum biomarker in pancreatic ductal

adenocarcinoma. BMC Cancer, 14(1), 578. https://doi.org/10.1186/1471-2407-14-578

Watkins, G., Douglas-Jones, A., Bryce, R., E Mansel, R., & Jiang, W. G. (2005). Increased

levels of SPARC (osteonectin) in human breast cancer tissues and its association with

clinical outcomes. Prostaglandins, Leukotrienes and Essential Fatty Acids, 72(4), 267–272.

https://doi.org/10.1016/j.plefa.2004.12.003

Wolpin, B. M., Chan, A. T., Hartge, P., Chanock, S. J., Kraft, P., Hunter, D. J., … Fuchs, C. S.

(2009). ABO Blood Group and the Risk of Pancreatic Cancer. JNCI Journal of the National

Cancer Institute, 101(6), 424–431. https://doi.org/10.1093/jnci/djp020

Wolpin, Brian M., Kraft, P., Gross, M., Helzlsouer, K., Bueno-de-Mesquita, H. B., Steplowski,

E., … Fuchs, C. S. (2010). Pancreatic Cancer Risk and ABO Blood Group Alleles: Results

from the Pancreatic Cancer Cohort Consortium. Cancer Research, 70(3), 1015–1023.

https://doi.org/10.1158/0008-5472.CAN-09-2993

Wong, M. C. S., Jiang, J. Y., Liang, M., Fang, Y., Yeung, M. S., & Sung, J. J. Y. (2017). Global

temporal patterns of pancreatic cancer and association with socioeconomic development.

Scientific Reports, 7(1), 3165. https://doi.org/10.1038/s41598-017-02997-2

WOOD, H. E., GUPTA, S., KANG, J. Y., QUINN, M. J., MAXWELL, J. D., MUDAN, S., &

MAJEED, A. (2006). Pancreatic cancer in England and Wales 1975-2000: patterns and

trends in incidence, survival and mortality. Alimentary Pharmacology and Therapeutics,

23(8), 1205–1214. https://doi.org/10.1111/j.1365-2036.2006.02860.x

Wu, J., Aguilera, T., Shultz, D., Gudur, M., Rubin, D. L., Loo, B. W., … Li, R. (2016). Early-

Stage Non–Small Cell Lung Cancer: Quantitative Imaging Characteristics of 18 F

Fluorodeoxyglucose PET/CT Allow Prediction of Distant Metastasis. Radiology, 281(1),

270–278. https://doi.org/10.1148/radiol.2016151829

WU, X., LU, X. H., XU, T., QIAN, J. M., ZHAO, P., GUO, X. Z., … JIANG, W. J. (2006).

Evaluation of the diagnostic value of serum tumor markers, and fecal k-ras and p53 gene

mutations for pancreatic cancer. Chinese Journal of Digestive Diseases, 7(3), 170–174.

https://doi.org/10.1111/j.1443-9573.2006.00263.x

Xi, Y., Guo, F., Xu, Z., Li, C., Wei, W., Tian, P., … Yin, H. (2018). Radiomics signature: A

potential biomarker for the prediction of MGMT promoter methylation in glioblastoma.

128

Journal of Magnetic Resonance Imaging, 47(5), 1380–1387.

https://doi.org/10.1002/jmri.25860

Xiang, A., Lapuerta, P., Ryutov, A., Buckley, J., & Azen, S. (2000). Comparison of the

performance of neural network methods and Cox regression for censored survival data.

Computational Statistics & Data Analysis, 34(2), 243–257. https://doi.org/10.1016/S0167-

9473(99)00098-5

Yamada, R., Mizuno, S., Uchida, K., Yoneda, M., Kanayama, K., Inoue, H., … Isaji, S. (2016).

Human Equilibrative Nucleoside Transporter 1 Expression in Endoscopic Ultrasonography-

Guided Fine-Needle Aspiration Biopsy Samples Is a Strong Predictor of Clinical Response

and Survival in the Patients With Pancreatic Ductal Adenocarcinoma Undergoing

Gemcitabine-Based Chemoradiotherapy. Pancreas, 45(5), 761–771.

https://doi.org/10.1097/MPA.0000000000000597

Yamashita, K., Upadhay, S., Mimori, K., Inoue, H., & Mori, M. (2003). Clinical significance of

secreted protein acidic and rich in cystein in esophageal carcinoma and its relation to

carcinoma progression. Cancer, 97(10), 2412–2419. https://doi.org/10.1002/cncr.11368

Yamashita, R., Nishio, M., Do, R. K. G., Togashi, K., Kinh, R., Do, G., & Togashi, K. (2018).

Convolutional neural networks: an overview and application in radiology. Insights into

Imaging, 9(4), 611–629. https://doi.org/10.1007/s13244-018-0639-9

Yang, J.-Y., Sun, Y.-W., Liu, D.-J., Zhang, J.-F., Li, J., & Hua, R. (2014). MicroRNAs in stool

samples as potential screening biomarkers for pancreatic ductal adenocarcinoma cancer.

American Journal of Cancer Research, 4(6), 663–673.

Yang, L., Dong, D., Fang, M., Zhu, Y., Zang, Y., Liu, Z., … Tian, J. (2018). Can CT-based

radiomics signature predict KRAS/NRAS/BRAF mutations in colorectal cancer? European

Radiology, 28(5), 2058–2067. https://doi.org/10.1007/s00330-017-5146-8

Yang, P., Yang, Y. H., Zhou, B. B., & Zomaya, A. Y. (n.d.). A review of ensemble methods in

bioinformatics: * Including stability of feature selection and ensemble feature selection

methods (updated on 28 Sep. 2016). Retrieved from

http://www.maths.usyd.edu.au/u/pengyi/publication/EnsembleBioinformatics-v6.pdf

Yasaka, K., Akai, H., Abe, O., & Kiryu, S. (2018). Deep Learning with Convolutional Neural

Network for Differentiation of Liver Masses at Dynamic Contrast-enhanced CT: A

Preliminary Study. Radiology, 286(3), 887–896. https://doi.org/10.1148/radiol.2017170706

129

Yip, S. S. F., & Aerts, H. J. W. L. (2016). Applications and limitations of radiomics. Physics in

Medicine and Biology, 61(13), R150-66. https://doi.org/10.1088/0031-9155/61/13/R150

Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep

neural networks? Retrieved from http://arxiv.org/abs/1411.1792

Zeiler, M. D., & Fergus, R. Visualizing and Understanding Convolutional Networks.

https://doi.org/10.1007/978-3-319-10590-1_53

Zhang, B., Tian, J., Dong, D., Gu, D., Dong, Y., Zhang, L., … Zhang, S. (2017a). Radiomics

Features of Multiparametric MRI as Novel Prognostic Factors in Advanced Nasopharyngeal

Carcinoma. Clinical Cancer Research, 23(15), 4259–4269. https://doi.org/10.1158/1078-

0432.CCR-16-2910

Zhang, B., Tian, J., Dong, D., Gu, D., Dong, Y., Zhang, L., … Zhang, S. (2017b). Radiomics

Features of Multiparametric MRI as Novel Prognostic Factors in Advanced Nasopharyngeal

Carcinoma. Clinical Cancer Research, 23(15), 4259–4269. https://doi.org/10.1158/1078-

0432.CCR-16-2910

Zhang, Junjie, Baig, S., Wong, A., Haider, M. A., & Khalvati, F. (2016). A Local ROI-specific

Atlas-based Segmentation of Prostate Gland and Transitional Zone in Diffusion MRI.

Journal of Computational Vision and Imaging Systems.

Zhang, Y, Yang, J., Li, H., Wu, Y., Zhang, H., & Chen, W. (2015). Tumor markers CA19-9,

CA242 and CEA in the diagnosis of pancreatic cancer: A meta-analysis. International

Journal of Clinical and Experimental Medicine, 8(7), 11683–11691.

Zhang, Yucheng, Oikonomou, A., Wong, A., Haider, M. A., & Khalvati, F. (2017). Radiomics-

based Prognosis Analysis for Non-Small Cell Lung Cancer. Nature Scientific Reports,

7(46349). https://doi.org/10.1038/srep16630

Zhao, B., Tan, Y., Tsai, W.-Y., Qi, J., Xie, C., Lu, L., & Schwartz, L. H. (2016). Reproducibility

of radiomics for deciphering tumor phenotype with imaging. Scientific Reports, 6, 23428.

https://doi.org/10.1038/srep23428

Zhao, X., Wu, Y., Song, G., Li, Z., Zhang, Y., & Fan, Y. (2018). A deep learning model

integrating FCNNs and CRFs for brain tumor segmentation. Medical Image Analysis, 43,

98–111. https://doi.org/10.1016/j.media.2017.10.002

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (n.d.). Learning Deep Features

for Discriminative Localization. Retrieved from http://cnnlocalization.csail.mit.edu

130

Zhou, H., Dong, D., Chen, B., Fang, M., Cheng, Y., Gan, Y., … Tian, J. (2018). Diagnosis of

Distant Metastasis of Lung Cancer: Based on Clinical and Radiomic Features. Translational

Oncology, 11(1), 31–36. https://doi.org/10.1016/j.tranon.2017.10.010

Zhou, Z., Chen, L., Sher, D., Zhang, Q., Shah, J., Pham, N.-L., … Wang, J. (2018). Predicting

Lymph Node Metastasis in Head and Neck Cancer by Combining Many-objective

Radiomics and 3-dimensioal Convolutional Neural Network through Evidential Reasoning.


Zwanenburg, A., Leger, S., Vallières, M., & Löck, S. (2016). Image biomarker standardisation

initiative. Retrieved from http://arxiv.org/abs/1612.07003

131

Appendix

Table A1: List of significant PyRadiomics feature for PDAC prognosis

Feature HR 95% CI p value

wavelet.LHL_glcm_Contrast 1.607 1.214 ~ 2.128 0.001

wavelet.HLH_glszm_HighGrayLevelZoneEmphasis 0.619 0.465 ~ 0.826 0.001

wavelet.HLH_glszm_LowGrayLevelZoneEmphasis 1.614 1.211 ~ 2.152 0.001

wavelet.LHL_glcm_DifferenceVariance 1.506 1.176 ~ 1.929 0.001

wavelet.LHH_firstorder_Variance 1.55 1.179 ~ 2.038 0.002

gradient_gldm_SmallDependenceEmphasis 1.536 1.172 ~ 2.012 0.002

wavelet.LHL_glcm_SumSquares 1.505 1.161 ~ 1.951 0.002

gradient_glszm_ZonePercentage 1.482 1.153 ~ 1.905 0.002

wavelet.LHH_firstorder_Minimum 0.669 0.517 ~ 0.865 0.002

wavelet.LHL_firstorder_Variance 1.493 1.154 ~ 1.932 0.002

wavelet.LHL_gldm_GrayLevelVariance 1.492 1.153 ~ 1.932 0.002

wavelet.LHH_firstorder_RootMeanSquared 1.563 1.168 ~ 2.091 0.003

wavelet.LHL_glszm_SizeZoneNonUniformityNormalized 1.598 1.171 ~ 2.181 0.003

wavelet.LHH_firstorder_MeanAbsoluteDeviation 1.56 1.161 ~ 2.097 0.003

wavelet.LHL_firstorder_MeanAbsoluteDeviation 1.563 1.16 ~ 2.106 0.003

wavelet.LHL_glrlm_GrayLevelVariance 1.456 1.132 ~ 1.873 0.003

wavelet.LHL_firstorder_RootMeanSquared 1.53 1.148 ~ 2.039 0.004

wavelet.LHH_firstorder_90Percentile 1.515 1.142 ~ 2.008 0.004

gradient_glcm_DifferenceAverage 1.502 1.138 ~ 1.981 0.004

132

wavelet.LHL_firstorder_90Percentile 1.527 1.144 ~ 2.037 0.004

wavelet.LHL_glcm_DifferenceAverage 1.523 1.137 ~ 2.038 0.005

wavelet.LHL_glszm_SmallAreaEmphasis 1.587 1.149 ~ 2.192 0.005

gradient_ngtdm_Contrast 1.413 1.11 ~ 1.8 0.005

wavelet.LHL_gldm_SmallDependenceEmphasis 1.527 1.135 ~ 2.054 0.005

gradient_gldm_SmallDependenceLowGrayLevelEmphasis 1.447 1.114 ~ 1.879 0.006

wavelet.LLL_glcm_DifferenceAverage 1.495 1.122 ~ 1.992 0.006

wavelet.LHL_ngtdm_Complexity 1.461 1.115 ~ 1.915 0.006

gradient_glcm_JointAverage 1.499 1.123 ~ 2.003 0.006

gradient_glcm_SumAverage 1.499 1.123 ~ 2.003 0.006

wavelet.HLH_glszm_SmallAreaHighGrayLevelEmphasis 0.639 0.464 ~ 0.88 0.006

wavelet.LHL_glcm_ClusterTendency 1.411 1.101 ~ 1.809 0.007

gradient_firstorder_Mean 1.484 1.116 ~ 1.975 0.007

wavelet.LHL_firstorder_RobustMeanAbsoluteDeviation 1.484 1.115 ~ 1.975 0.007

wavelet.LHL_firstorder_10Percentile 0.672 0.504 ~ 0.897 0.007

wavelet.LHH_firstorder_RobustMeanAbsoluteDeviation 1.476 1.111 ~ 1.961 0.007

wavelet.LHL_glcm_DifferenceEntropy 1.514 1.119 ~ 2.048 0.007

wavelet.LHL_glszm_ZonePercentage 1.5 1.116 ~ 2.017 0.007

gradient_glcm_DifferenceEntropy 1.526 1.12 ~ 2.078 0.007

wavelet.HLH_firstorder_MeanAbsoluteDeviation 1.444 1.104 ~ 1.889 0.007

original_glcm_Contrast 1.46 1.106 ~ 1.926 0.007

gradient_glrlm_RunLengthNonUniformityNormalized 1.504 1.114 ~ 2.031 0.008

133

wavelet.LHL_firstorder_Entropy 1.515 1.116 ~ 2.058 0.008

wavelet.LHL_glcm_SumEntropy 1.51 1.114 ~ 2.047 0.008

original_glcm_DifferenceAverage 1.465 1.104 ~ 1.946 0.008

wavelet.LHL_firstorder_InterquartileRange 1.459 1.101 ~ 1.934 0.009

wavelet.HLL_firstorder_MeanAbsoluteDeviation 1.432 1.094 ~ 1.876 0.009

wavelet.LHH_firstorder_InterquartileRange 1.453 1.097 ~ 1.926 0.009

wavelet.LHL_glcm_JointEntropy 1.488 1.101 ~ 2.011 0.01

wavelet.HHL_glcm_Contrast 1.405 1.086 ~ 1.817 0.01

gradient_glrlm_ShortRunEmphasis 1.533 1.108 ~ 2.121 0.01

wavelet.LHL_glszm_GrayLevelNonUniformityNormalized 0.658 0.478 ~ 0.905 0.01

wavelet.HHL_glcm_SumSquares 1.39 1.08 ~ 1.79 0.011

gradient_glcm_Idm 0.682 0.508 ~ 0.915 0.011

squareroot_glcm_Contrast 1.474 1.094 ~ 1.988 0.011

original_glcm_SumSquares 1.445 1.088 ~ 1.919 0.011

original_glcm_DifferenceEntropy 1.458 1.09 ~ 1.949 0.011

wavelet.HHL_firstorder_RootMeanSquared 1.42 1.082 ~ 1.862 0.011

gradient_glcm_Id 0.683 0.508 ~ 0.918 0.011

wavelet.LHL_glrlm_RunEntropy 1.475 1.091 ~ 1.994 0.011

wavelet.HHL_firstorder_Variance 1.367 1.072 ~ 1.743 0.012

original_glcm_Idm 0.689 0.516 ~ 0.921 0.012

wavelet.LHL_glcm_Id 0.68 0.504 ~ 0.919 0.012

wavelet.HLL_glcm_Imc2 1.433 1.081 ~ 1.898 0.012

134

wavelet.HHL_glcm_ClusterTendency 1.37 1.071 ~ 1.752 0.012

wavelet.HHL_glcm_DifferenceVariance 1.348 1.067 ~ 1.703 0.012

original_glcm_Id 0.69 0.516 ~ 0.924 0.013

wavelet.HHL_gldm_GrayLevelVariance 1.364 1.069 ~ 1.741 0.013

wavelet.HLL_glcm_DifferenceAverage 1.43 1.079 ~ 1.895 0.013

wavelet.LHH_firstorder_10Percentile 0.7 0.528 ~ 0.927 0.013

wavelet.LHL_glcm_Idm 0.684 0.507 ~ 0.923 0.013

squareroot_glcm_SumSquares 1.459 1.082 ~ 1.966 0.013

original_glcm_JointEntropy 1.459 1.082 ~ 1.967 0.013

original_glrlm_RunLengthNonUniformityNormalized 1.447 1.08 ~ 1.938 0.013

gradient_glszm_SmallAreaEmphasis 1.408 1.072 ~ 1.849 0.014

wavelet.HHL_gldm_SmallDependenceEmphasis 1.425 1.075 ~ 1.89 0.014

wavelet.HHL_glrlm_GrayLevelVariance 1.345 1.061 ~ 1.706 0.014

gradient_glrlm_RunPercentage 1.448 1.077 ~ 1.946 0.014

wavelet.LHL_glrlm_RunLengthNonUniformityNormalized 1.468 1.08 ~ 1.997 0.014

gradient_firstorder_Entropy 1.458 1.078 ~ 1.973 0.014

gradient_firstorder_90Percentile 1.429 1.073 ~ 1.902 0.014

wavelet.LLL_gldm_DependenceNonUniformityNormalized 1.455 1.077 ~ 1.967 0.015

wavelet.LHH_firstorder_Range 1.406 1.069 ~ 1.848 0.015

gradient_glcm_JointEntropy 1.445 1.075 ~ 1.943 0.015

original_firstorder_MeanAbsoluteDeviation 1.469 1.078 ~ 2.001 0.015

wavelet.HHL_firstorder_MeanAbsoluteDeviation 1.398 1.067 ~ 1.831 0.015

135

wavelet.HHH_firstorder_RootMeanSquared 1.388 1.065 ~ 1.808 0.015

gradient_firstorder_MeanAbsoluteDeviation 1.349 1.06 ~ 1.718 0.015

wavelet.HHH_firstorder_MeanAbsoluteDeviation 1.398 1.066 ~ 1.832 0.015

wavelet.LHL_glrlm_GrayLevelNonUniformityNormalized 0.663 0.476 ~ 0.925 0.015

original_glrlm_ShortRunEmphasis 1.459 1.075 ~ 1.98 0.015

wavelet.LHL_gldm_DependenceEntropy 1.485 1.078 ~ 2.045 0.016

original_glcm_DifferenceVariance 1.372 1.062 ~ 1.773 0.016

wavelet.HHL_glszm_ZonePercentage 1.411 1.067 ~ 1.865 0.016

gradient_glcm_Autocorrelation 1.321 1.053 ~ 1.657 0.016

gradient_gldm_LowGrayLevelEmphasis 0.701 0.524 ~ 0.937 0.016

original_gldm_GrayLevelVariance 1.395 1.063 ~ 1.83 0.016

squareroot_glcm_DifferenceAverage 1.442 1.07 ~ 1.945 0.016

wavelet.HHH_firstorder_10Percentile 0.723 0.554 ~ 0.942 0.016

wavelet.LLL_gldm_SmallDependenceEmphasis 1.405 1.064 ~ 1.854 0.016

wavelet.HLL_firstorder_90Percentile 1.427 1.066 ~ 1.908 0.017

wavelet.LLL_glrlm_RunLengthNonUniformityNormalized 1.445 1.069 ~ 1.953 0.017

wavelet.LHL_firstorder_Uniformity 0.677 0.492 ~ 0.932 0.017

wavelet.LHL_glcm_Imc2 1.414 1.064 ~ 1.879 0.017

squareroot_ngtdm_Complexity 1.405 1.062 ~ 1.857 0.017

wavelet.LHL_glrlm_RunPercentage 1.461 1.069 ~ 1.996 0.017

original_glszm_ZonePercentage 1.405 1.062 ~ 1.86 0.017

original_firstorder_Variance 1.39 1.059 ~ 1.822 0.017

136

wavelet.HHL_glcm_DifferenceAverage 1.382 1.058 ~ 1.805 0.017

gradient_gldm_LargeDependenceEmphasis 0.697 0.518 ~ 0.939 0.018

wavelet.HLH_firstorder_RootMeanSquared 1.306 1.047 ~ 1.628 0.018

gradient_gldm_LargeDependenceLowGrayLevelEmphasis 0.699 0.52 ~ 0.94 0.018

wavelet.LLL_glcm_Id 0.697 0.516 ~ 0.94 0.018

gradient_glcm_SumEntropy 1.439 1.064 ~ 1.945 0.018

wavelet.HLH_firstorder_90Percentile 1.419 1.061 ~ 1.896 0.018

wavelet.LLL_glszm_ZonePercentage 1.415 1.061 ~ 1.887 0.018

squareroot_glcm_ClusterProminence 1.379 1.056 ~ 1.8 0.018

wavelet.HLL_firstorder_RootMeanSquared 1.31 1.047 ~ 1.64 0.018

wavelet.LHL_glrlm_ShortRunEmphasis 1.488 1.07 ~ 2.071 0.018

original_glrlm_RunPercentage 1.427 1.062 ~ 1.917 0.018

squareroot_glcm_DifferenceVariance 1.427 1.061 ~ 1.917 0.018

wavelet.LLL_glcm_DifferenceEntropy 1.432 1.062 ~ 1.932 0.019

gradient_firstorder_Median 1.411 1.059 ~ 1.88 0.019

wavelet.HHL_glrlm_RunVariance 0.692 0.509 ~ 0.941 0.019

wavelet.LLL_glcm_Idm 0.7 0.519 ~ 0.943 0.019

original_gldm_SmallDependenceEmphasis 1.395 1.056 ~ 1.844 0.019

wavelet.HHL_firstorder_RobustMeanAbsoluteDeviation 1.368 1.052 ~ 1.778 0.019

squareroot_gldm_SmallDependenceHighGrayLevelEmphasis 1.448 1.062 ~ 1.975 0.019

original_firstorder_Entropy 1.444 1.061 ~ 1.964 0.019

squareroot_gldm_GrayLevelVariance 1.428 1.059 ~ 1.927 0.02

137

squareroot_firstorder_Variance 1.425 1.058 ~ 1.92 0.02

wavelet.HHL_glcm_Id 0.718 0.544 ~ 0.949 0.02

wavelet.HHL_glcm_Idm 0.719 0.544 ~ 0.949 0.02

squareroot_glszm_SizeZoneNonUniformityNormalized 1.44 1.059 ~ 1.957 0.02

wavelet.HHH_firstorder_Variance 1.318 1.044 ~ 1.665 0.02

wavelet.HHL_glrlm_RunPercentage 1.404 1.054 ~ 1.872 0.02

wavelet.HLH_firstorder_RobustMeanAbsoluteDeviation 1.399 1.053 ~ 1.859 0.021

wavelet.LLL_glrlm_ShortRunEmphasis 1.437 1.057 ~ 1.954 0.021

original_glszm_LargeAreaLowGrayLevelEmphasis 0.612 0.404 ~ 0.928 0.021

wavelet.HHH_firstorder_90Percentile 1.367 1.049 ~ 1.781 0.021

wavelet.LHL_ngtdm_Contrast 1.358 1.047 ~ 1.762 0.021

wavelet.HHL_firstorder_InterquartileRange 1.357 1.047 ~ 1.761 0.021

wavelet.HHL_glcm_SumEntropy 1.391 1.05 ~ 1.844 0.022

wavelet.HHL_firstorder_90Percentile 1.362 1.046 ~ 1.773 0.022

wavelet.HHL_glcm_DifferenceEntropy 1.392 1.049 ~ 1.847 0.022

gradient_firstorder_10Percentile 1.392 1.048 ~ 1.848 0.022

original_gldm_LargeDependenceEmphasis 0.704 0.52 ~ 0.951 0.022

wavelet.LHL_glcm_InverseVariance 0.715 0.536 ~ 0.954 0.022

gradient_glszm_LargeAreaLowGrayLevelEmphasis 0.635 0.43 ~ 0.938 0.023

wavelet.LLL_glszm_SizeZoneNonUniformityNormalized 1.364 1.045 ~ 1.78 0.023

wavelet.LLL_glszm_SmallAreaEmphasis 1.387 1.047 ~ 1.837 0.023

wavelet.HLH_firstorder_InterquartileRange 1.391 1.047 ~ 1.848 0.023

138

wavelet.HHL_firstorder_10Percentile 0.734 0.562 ~ 0.958 0.023

squareroot_glszm_SmallAreaEmphasis 1.452 1.053 ~ 2.001 0.023

gradient_glcm_MaximumProbability 0.715 0.536 ~ 0.954 0.023

wavelet.HHH_firstorder_RobustMeanAbsoluteDeviation 1.356 1.043 ~ 1.764 0.023

wavelet.HHL_glcm_MaximumProbability 0.723 0.547 ~ 0.956 0.023

wavelet.HHL_glcm_JointEntropy 1.385 1.046 ~ 1.835 0.023

gradient_glszm_LargeAreaEmphasis 0.638 0.432 ~ 0.94 0.023

squareroot_glrlm_GrayLevelVariance 1.411 1.048 ~ 1.9 0.023

wavelet.HHL_gldm_LargeDependenceEmphasis 0.716 0.536 ~ 0.956 0.023

wavelet.LHL_gldm_DependenceNonUniformityNormalized 1.37 1.043 ~ 1.8 0.024

gradient_glcm_InverseVariance 1.408 1.047 ~ 1.893 0.024

gradient_glszm_ZoneVariance 0.642 0.438 ~ 0.943 0.024

wavelet.HHL_glrlm_LongRunEmphasis 0.705 0.52 ~ 0.954 0.024

wavelet.HHL_glrlm_RunLengthNonUniformityNormalized 1.379 1.044 ~ 1.823 0.024

wavelet.HHL_firstorder_Entropy 1.384 1.044 ~ 1.835 0.024

wavelet.HHL_firstorder_Uniformity 0.72 0.542 ~ 0.957 0.024

gradient_glszm_SizeZoneNonUniformityNormalized 1.346 1.04 ~ 1.741 0.024

squareroot_gldm_SmallDependenceEmphasis 1.419 1.047 ~ 1.923 0.024

gradient_firstorder_Uniformity 0.707 0.523 ~ 0.955 0.024

squareroot_glcm_ClusterTendency 1.392 1.044 ~ 1.855 0.024

squareroot_gldm_DependenceNonUniformityNormalized 1.398 1.045 ~ 1.87 0.024

original_glcm_ClusterTendency 1.389 1.044 ~ 1.849 0.024

139

wavelet.LHL_glszm_GrayLevelVariance 1.338 1.039 ~ 1.723 0.024

gradient_glcm_JointEnergy 0.705 0.52 ~ 0.956 0.024

gradient_firstorder_RootMeanSquared 1.33 1.038 ~ 1.706 0.024

wavelet.HHH_firstorder_InterquartileRange 1.349 1.039 ~ 1.751 0.025

wavelet.HLL_firstorder_Entropy 1.401 1.044 ~ 1.88 0.025

wavelet.HHL_glcm_JointEnergy 0.715 0.533 ~ 0.959 0.025

wavelet.LLL_glcm_InverseVariance 0.721 0.542 ~ 0.96 0.025

gradient_gldm_DependenceEntropy 1.437 1.046 ~ 1.975 0.025

wavelet.HHL_glrlm_ShortRunEmphasis 1.394 1.042 ~ 1.866 0.025

wavelet.LHL_gldm_SmallDependenceHighGrayLevelEmphasis 1.374 1.04 ~ 1.815 0.026

wavelet.HHL_gldm_DependenceEntropy 1.387 1.04 ~ 1.848 0.026

wavelet.LHL_gldm_LargeDependenceEmphasis 0.693 0.502 ~ 0.957 0.026

wavelet.LLL_glrlm_RunPercentage 1.418 1.042 ~ 1.928 0.026

gradient_glszm_LargeAreaHighGrayLevelEmphasis 0.653 0.448 ~ 0.951 0.026

wavelet.HLL_firstorder_RobustMeanAbsoluteDeviation 1.382 1.039 ~ 1.839 0.026

wavelet.HLL_gldm_SmallDependenceEmphasis 1.376 1.036 ~ 1.828 0.028

original_firstorder_RobustMeanAbsoluteDeviation 1.399 1.038 ~ 1.886 0.028

wavelet.HLL_glcm_SumEntropy 1.394 1.037 ~ 1.873 0.028

wavelet.HHL_glrlm_GrayLevelNonUniformityNormalized 0.718 0.534 ~ 0.964 0.028

gradient_firstorder_RobustMeanAbsoluteDeviation 1.37 1.035 ~ 1.814 0.028

squareroot_glszm_ZonePercentage 1.412 1.038 ~ 1.922 0.028

original_glcm_SumEntropy 1.41 1.037 ~ 1.916 0.028

140

wavelet.HLL_glcm_DifferenceEntropy 1.386 1.035 ~ 1.856 0.028

original_glcm_JointEnergy 0.703 0.513 ~ 0.964 0.028

gradient_glrlm_ShortRunLowGrayLevelEmphasis 1.411 1.036 ~ 1.921 0.029

wavelet.LLL_ngtdm_Contrast 1.257 1.024 ~ 1.543 0.029

original_glrlm_LongRunLowGrayLevelEmphasis 0.713 0.525 ~ 0.967 0.03

wavelet.HLL_ngtdm_Busyness 0.685 0.487 ~ 0.964 0.03

squareroot_glcm_InverseVariance 0.714 0.526 ~ 0.968 0.03

original_gldm_DependenceNonUniformityNormalized 1.346 1.029 ~ 1.762 0.03

original_ngtdm_Contrast 1.331 1.027 ~ 1.725 0.031

squareroot_glcm_Id 0.708 0.517 ~ 0.97 0.031

wavelet.HLL_firstorder_10Percentile 0.741 0.564 ~ 0.974 0.032

original_firstorder_Uniformity 0.711 0.521 ~ 0.971 0.032

wavelet.HLH_firstorder_10Percentile 0.738 0.559 ~ 0.974 0.032

wavelet.LHL_glcm_JointEnergy 0.683 0.481 ~ 0.969 0.033

gradient_gldm_HighGrayLevelEmphasis 1.278 1.02 ~ 1.601 0.033

original_gldm_LargeDependenceLowGrayLevelEmphasis 0.715 0.526 ~ 0.973 0.033

original_gldm_SmallDependenceHighGrayLevelEmphasis 1.335 1.023 ~ 1.742 0.033

wavelet.HLL_glcm_JointEntropy 1.37 1.024 ~ 1.832 0.034

wavelet.HLL_glszm_GrayLevelNonUniformityNormalized 0.727 0.541 ~ 0.977 0.035

squareroot_glcm_Idm 0.71 0.517 ~ 0.976 0.035

wavelet.HLL_glcm_Id 0.732 0.547 ~ 0.978 0.035

original_glrlm_GrayLevelNonUniformityNormalized 0.714 0.521 ~ 0.977 0.035

141

wavelet.LLL_glcm_JointEntropy 1.385 1.023 ~ 1.874 0.035

wavelet.HLL_firstorder_InterquartileRange 1.355 1.021 ~ 1.799 0.036

wavelet.HLL_glcm_Idm 0.734 0.55 ~ 0.98 0.036

gradient_glrlm_GrayLevelNonUniformityNormalized 0.731 0.545 ~ 0.98 0.036

gradient_gldm_DependenceNonUniformity 0.711 0.517 ~ 0.978 0.036

gradient_firstorder_InterquartileRange 1.349 1.019 ~ 1.785 0.036

squareroot_glcm_DifferenceEntropy 1.386 1.02 ~ 1.885 0.037

original_firstorder_InterquartileRange 1.37 1.019 ~ 1.843 0.037

squareroot_firstorder_MeanAbsoluteDeviation 1.38 1.019 ~ 1.869 0.038

wavelet.LLL_glcm_SumSquares 1.28 1.014 ~ 1.616 0.038

original_glszm_LargeAreaEmphasis 0.669 0.458 ~ 0.978 0.038

wavelet.HLL_glszm_SmallAreaEmphasis 1.387 1.018 ~ 1.89 0.038

wavelet.HLL_glszm_ZonePercentage 1.35 1.016 ~ 1.793 0.038

original_glrlm_LongRunEmphasis 0.707 0.509 ~ 0.982 0.038

wavelet.HLL_glrlm_GrayLevelNonUniformityNormalized 0.725 0.535 ~ 0.983 0.038

wavelet.LLL_gldm_LargeDependenceEmphasis 0.714 0.519 ~ 0.982 0.039

original_glszm_ZoneVariance 0.67 0.459 ~ 0.979 0.039

wavelet.HHL_ngtdm_Complexity 1.302 1.013 ~ 1.674 0.039

wavelet.LHL_glszm_SmallAreaHighGrayLevelEmphasis 1.341 1.014 ~ 1.773 0.04

logarithm_ngtdm_Complexity 1.374 1.015 ~ 1.862 0.04

wavelet.HLL_glrlm_RunPercentage 1.36 1.014 ~ 1.824 0.04

wavelet.HLL_firstorder_Uniformity 0.734 0.546 ~ 0.987 0.041

142

logarithm_glcm_ClusterProminence 1.376 1.013 ~ 1.87 0.041

wavelet.LHL_glrlm_LongRunEmphasis 0.681 0.471 ~ 0.985 0.041

wavelet.LHL_glrlm_ShortRunHighGrayLevelEmphasis 1.344 1.011 ~ 1.786 0.042

wavelet.LLL_glszm_LargeAreaLowGrayLevelEmphasis 0.689 0.481 ~ 0.987 0.042

wavelet.HLL_glrlm_RunLengthNonUniformityNormalized 1.354 1.011 ~ 1.815 0.042

squareroot_glszm_SmallAreaHighGrayLevelEmphasis 1.379 1.011 ~ 1.882 0.042

gradient_glrlm_LowGrayLevelRunEmphasis 0.733 0.542 ~ 0.99 0.043

wavelet.LLL_glcm_ClusterTendency 1.288 1.008 ~ 1.647 0.043

wavelet.LLL_glcm_Contrast 1.254 1.007 ~ 1.562 0.043

wavelet.LHL_gldm_GrayLevelNonUniformity 0.736 0.546 ~ 0.991 0.043

wavelet.LLL_firstorder_Variance 1.274 1.007 ~ 1.612 0.044

wavelet.LLL_gldm_GrayLevelVariance 1.274 1.007 ~ 1.611 0.044

squareroot_glrlm_RunLengthNonUniformityNormalized 1.394 1.009 ~ 1.925 0.044

wavelet.LHL_glcm_MaximumProbability 0.718 0.519 ~ 0.991 0.044

logarithm_glszm_SizeZoneNonUniformityNormalized 1.355 1.008 ~ 1.823 0.044

wavelet.LHL_glrlm_RunVariance 0.693 0.485 ~ 0.991 0.045

wavelet.HLL_gldm_LargeDependenceEmphasis 0.738 0.549 ~ 0.993 0.045

squareroot_glszm_GrayLevelVariance 1.339 1.006 ~ 1.782 0.045

original_glcm_MaximumProbability 0.725 0.529 ~ 0.993 0.045

logarithm_glszm_SmallAreaEmphasis 1.371 1.006 ~ 1.869 0.046

squareroot_gldm_GrayLevelNonUniformity 0.739 0.549 ~ 0.994 0.046

original_glrlm_GrayLevelVariance 1.295 1.005 ~ 1.669 0.046

143

wavelet.HLL_glszm_SizeZoneNonUniformityNormalized 1.337 1.005 ~ 1.777 0.046

wavelet.LLL_firstorder_MeanAbsoluteDeviation 1.35 1.006 ~ 1.813 0.046

wavelet.HLL_glrlm_ShortRunEmphasis 1.358 1.004 ~ 1.837 0.047

squareroot_firstorder_RobustMeanAbsoluteDeviation 1.346 1.004 ~ 1.806 0.047

wavelet.LHH_firstorder_TotalEnergy 1.313 1.003 ~ 1.719 0.048

squareroot_firstorder_90Percentile 1.334 1.003 ~ 1.775 0.048

wavelet.LHL_firstorder_TotalEnergy 1.317 1.002 ~ 1.731 0.048

gradient_glcm_Contrast 1.251 1.002 ~ 1.563 0.048

wavelet.LLL_gldm_SmallDependenceHighGrayLevelEmph 1.332 1.002 ~ 1.77 0.048

wavelet.HHL_glcm_InverseVariance 0.773 0.598 ~ 0.999 0.049

squareroot_glrlm_RunPercentage 1.399 1.001 ~ 1.954 0.049

deep radiomics analytic pipeline for prognosis of ......chapter 1: literature review 1.1 pancreatic...

Documents