deep radiomics analytic pipeline for prognosis of ......chapter 1: literature review 1.1 pancreatic...
TRANSCRIPT
Deep Radiomics Analytics Pipeline for Prognosis of Pancreatic Ductal Adenocarcinoma
By
Yucheng Zhang
A thesis submitted in conformity with the requirements
for the degree of Master of Science
Institute of Medical Science
University of Toronto
© Copyright by Yucheng Zhang (2019)
ii
Deep Radiomics Analytics Pipeline for Pancreatic Ductal Adenocarcinoma
Yucheng Zhang
Master of Science
Institute of Medical Science
University of Toronto
2019
Abstract
Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive cancers with
extremely poor prognosis. Radiomics has shown prognostic ability in multiple types of
cancer including PDAC. However, the prognostic value of traditional radiomics pipelines,
which are based on hand-crafted radiomic features alone, is limited due to high correlation
among features and the multiple testing problem. Deep learning architectures, such as
Convolutional Neural networks (CNNs) have been shown to outperform traditional feature-
based approaches in computer vision tasks such as object detection. Nonetheless, they require
large sample sizes for training which limits their application in medical imaging. As an
alternative solution, CNN-based transfer learning has shown potential for achieving
reasonable performance using datasets with small sample sizes. In this work, we developed a
CNN-based deep radiomics pipeline based on transfer learning, which outperforms the
traditional radiomics model in resectable PDAC prognostication.
iii
Acknowledgements
First and foremost, I would like to thank my supervisors, Dr. Farzad Khalvati and Dr. Masoom Haider. I
appreciate all your contributions of time, inspirations and efforts. This project would not have been
possible without your guidance. It is a great honor to be a member of this research group for four years.
As an international student, I realized that, this research group has become my home in Canada.
I would like to express my sincerest gratitude to my committee members: Dr. Babak Taati and Dr.
Qiang Sun. It was my pleasure to discuss the research with you and I sincerely appreciate your valuable
time, suggestions, and insightful discussions.
I would also like to thank my family, mother Chen Wang, my grandparents Liying Qian and Tiancai
Wang. Thank you for your love, supports, and encouragements on this journey. Although we are few
thousand miles away, every phone call and text message from you has strengthened my resolve. I also
need to thank my dear friends in Beijing, Connecticut, and Toronto. Thank you for all your support.
In the end, I want to express my deepest gratitude and respect to patients enrolled in this study. It is your
contribution that made this project possible.
iv
Contributions
Dr. Farzad Khalvati and Dr. Masoom Haider:
Supervised and directed all aspects of this research study and thesis.
Dr. Edrise M. Lobo-Mueller:
Contoured the Region of Interest on pre-operative CT images.
Dr. Paul Karanicolas and Dr. Steven Gallinger:
Assisted in patient enrollment and provided essential data.
v
Table of Contents
Abstract ....................................................................................................................................................... ii
Acknowledgements .................................................................................................................................... iii
Contributions.............................................................................................................................................. iv
List of tables ............................................................................................................................................... ix
List of Figures ............................................................................................................................................. x
List of Abbreviations ................................................................................................................................ xii
Chapter 1: Literature Review ...................................................................................................................... 1
1.1 Pancreatic Ductal Adenocarcinoma .................................................................................................. 1
1.1.1 Introduction ................................................................................................................................ 1
1.1.2 Risk factors ................................................................................................................................ 1
1.1.3 Diagnostic biomarkers ............................................................................................................... 3
1.1.4 Treatment ................................................................................................................................... 6
1.1.5 Biomarker for chemotherapy response ...................................................................................... 7
1.1.6 Prognostic markers..................................................................................................................... 9
1.2 Radiomics: analysis of quantitative imaging markers .................................................................... 11
1.2.1 Introduction .............................................................................................................................. 11
1.2.2 Pipeline .................................................................................................................................... 12
1.2.3 Segmentation............................................................................................................................ 12
1.2.4 Feature extraction..................................................................................................................... 14
1.2.5 Feature analysis and model building........................................................................................ 15
1.2.6 Current progress ....................................................................................................................... 17
1.2.7 Limitations of traditional radiomics analytic pipeline ............................................................. 20
vi
1.3 Deep learning in medical imaging .................................................................................................. 23
1.3.1 Neural Networks and CNN ...................................................................................................... 23
1.3.2 ResNet ...................................................................................................................................... 31
1.3.3 Transfer learning ...................................................................................................................... 32
1.3.4 Deep learning in medical imaging research ............................................................................. 34
1.3.5 Future direction ........................................................................................................................ 38
Chapter 2: Aim and hypothesis ................................................................................................................. 40
2.1 Study 1 ............................................................................................................................................ 40
2.1.1 Aims ......................................................................................................................................... 40
2.1.2 Hypothesis................................................................................................................................ 40
2.1.3 Rationale for hypothesis .......................................................................................................... 41
2.2 Study 2 ............................................................................................................................................ 41
2.2.1 Aims ......................................................................................................................................... 41
2.2.2 Hypothesis................................................................................................................................ 42
2.2.3 Rationale for hypothesis .......................................................................................................... 42
2.3 Study 3 ............................................................................................................................................ 43
2.3.1 Aims ......................................................................................................................................... 43
2.3.2 Hypothesis................................................................................................................................ 43
2.3.3 Rationale for hypothesis .......................................................................................................... 43
Chapter 3: Study 1 .................................................................................................................................... 45
3.1 Abstract ........................................................................................................................................... 46
3.2 Introduction ..................................................................................................................................... 46
3.3 Methods........................................................................................................................................... 49
3.3.1 Dataset...................................................................................................................................... 49
3.3.2 Radiomics feature extraction ................................................................................................... 50
vii
3.3.3 Transfer learning ...................................................................................................................... 51
3.3.4 Feature analysis ........................................................................................................................ 52
3.4 Results ............................................................................................................................................. 53
3.4.1 Feature-wise prognostic values ................................................................................................ 53
3.4.2 Prognostic model performance ................................................................................................ 53
3.4.3 Risk score ................................................................................................................................. 54
3.5 Discussion ....................................................................................................................................... 56
3.6 Conclusion ...................................................................................................................................... 58
Chapter 4: Study 2 .................................................................................................................................... 59
4.1: Abstract .......................................................................................................................................... 60
4.2: Introduction .................................................................................................................................... 60
4.3 Methods........................................................................................................................................... 64
4.3.1 Dataset...................................................................................................................................... 64
4.3.2 Radiomics Feature Extraction .................................................................................................. 64
4.3.3 Transfer Learning Feature Extraction ...................................................................................... 65
4.3.4 Correlation ............................................................................................................................... 66
4.3.5 Proposed Prognosis Model ...................................................................................................... 66
4.4 Results ............................................................................................................................................. 68
4.4.1 Correlation Analysis Between Pre-defined and Deep Radiomic Features .............................. 68
4.4.2 Prognosis Performance of the Proposed Prognosis Model ...................................................... 70
4.5 Discussion ....................................................................................................................................... 73
Chapter 5: Study 3 .................................................................................................................................... 75
5.1 Abstract ........................................................................................................................................... 76
5.2 Introduction ..................................................................................................................................... 76
5.3 Methods........................................................................................................................................... 79
viii
5.3.1 Data .......................................................................................................................................... 79
5.3.2 Architecture of the proposed CNN-Survival ........................................................................... 79
5.3.3 Loss Function ........................................................................................................................... 80
5.3.4 Training process and Transfer Learning .................................................................................. 80
5.3.5 Traditional Radiomics analytic pipeline .................................................................................. 81
5.4 Results ............................................................................................................................................. 84
5.5 Discussion ....................................................................................................................................... 86
5.6 Conclusion ...................................................................................................................................... 87
Chapter 6: General Discussion.................................................................................................................. 88
6.1: Study 1 ........................................................................................................................................... 88
6.1.1 Discussion ................................................................................................................................ 88
6.1.2 Strength and limitations ........................................................................................................... 91
6.1.3 Implications.............................................................................................................................. 92
6.2: Study 2 ........................................................................................................................................... 93
6.2.1 Discussion ................................................................................................................................ 93
6.2.2 Strength and limitations ........................................................................................................... 96
6.2.3 Implications.............................................................................................................................. 97
6.3: Study 3 ........................................................................................................................................... 97
6.3.1 Discussion ................................................................................................................................ 97
6.3.2 Strength and limitations ........................................................................................................... 99
Chapter 7: Conclusions ........................................................................................................................... 100
Chapter 8: Future directions.................................................................................................................... 101
References ............................................................................................................................................... 102
Appendix ................................................................................................................................................. 131
ix
List of tables
Table 1.1: List of available biomarkers and their performances ................................................................4
Table 1.2: List of ECOG criteria ................................................................................................................9
Table 1.3: List of common features .........................................................................................................14
Table 1.4: List of recent radiomics studies and their performances in AUC ............................................22
Table 1.5: List of representative segmentation studies in the medical imaging field .............................37
Table 3.1: List of radiomic feature classes and filters .............................................................................51
Table 3.2: List of hazard ratios and p values ...........................................................................................56
Table 4.1: Number of features extracted from different filters ................................................................65
Table 4.2: Absolute Pearson correlation coefficient between features ....................................................68
Table 4.3: Summary table for models using four feature reduction methods ..........................................71
Table 5.1: Concordance index of proposed models .................................................................................85
Table A.1: List of significant PyRadiomics features for PDAC prognosis ...........................................131
x
List of Figures
Figure 1.1: Traditional Radiomics Pipeline ..............................................................................................12
Figure 1.2: Typical CNN architecture ......................................................................................................25
Figure 1.3: Graphical presentation of convolution operations .................................................................26
Figure 1.4: Graphical representation of zero-padding .............................................................................27
Figure 1.5: Graphical representation of max pooling ..............................................................................28
Figure 1.6: Graphical representation of Fully Connected Layers ............................................................29
Figure 1.7: Graphical representation of gradient descent algorithm ........................................................30
Figure 1.8: Graphical representation of identity path ..............................................................................31
Figure 1.9: Graphical representation of transfer learning in CNN ...........................................................33
Figure 3.1: Manual contour of CT scan from a representative patient in cohort 2 ...................................50
Figure 3.2: Workflow for transfer learning studies ..................................................................................52
Figure 3.3: ROC curves ............................................................................................................................54
Figure 3.4: Kaplan-Meier plots for OS in Cohort 2 ..................................................................................55
Figure 4.1: Pipelines for different feature fusion methods ......................................................................67
Figure 4.2: Correlation heatmap of three different feature extraction methods .......................................69
Figure 4.3: Histogram of Pearson correlation coefficients .......................................................................70
Figure 4.4: ROC curves of models using four feature reduction methods ..............................................72
Figure 5.1: The proposed CNN-Survival architecture .............................................................................82
xi
Figure 5.2: Example of input CT images ................................................................................................83
Figure 5.3: Example of small ROI in Cohort 1 .........................................................................................83
Figure 5.4: Loss changes during pre-train ................................................................................................84
Figure 5.5: Survival probability example 1 .............................................................................................85
Figure 5.6: Survival probability example 2 ..............................................................................................86
xii
List of Abbreviations
2D Two-dimensional
3D Three-dimensional
AI Artificial Intelligence
ANN Artificial Neural Network
ANOVA Analysis of Variance
AUC Area Under Curve
CAD Computer-aided diagnosis
CADe Computer-aided detection
CAM Class activation map
CBCT Cone beam computed tomography
CCC Concordance correlation coefficient
CI Confidence interval
CNN Convolutional Neural Network
CONV Convolution
CPH Cox Proportional Hazards Model
CT Computed tomography
DL Deep Learning
DNN Deep Neural Network
xiii
FBP Filtered back projection
FC Fully Connected
FCN Fully Convolutional Network
GAN Generative Adversarial Network
GLM Generalized linear model
GPU Graphical processing unit
HR Hazard Ratio
ICA Independent component analysis
ICC Intraclass correlation coefficient
ISBI International Symposium on Biomedical Imaging
LDA Linear Discriminant Analysis
LSTM Long Short-Term Memory
MR Magnetic resonance
NN Neural Network
NSCLC Non-small cell lung cancer
PCA Principal Component analysis
PDAC Pancreatic Ductal Adenocarcinoma
PET Positron emission tomography
ReLU Rectified Linear Unit
xiv
RF Random Forest
RGB Red, green, and blue
RNN Recurrent Neural Network
ROC Receiver operating characteristic
RR Relative Risk
SDG Stochastic gradient descent
SMOTE Synthetic Minority Over-sampling Technique
SVM Support Vector Machine
1
Chapter 1: Literature Review
1.1 Pancreatic Ductal Adenocarcinoma
1.1.1 Introduction
Pancreatic Ductal Adenocarcinoma (PDAC) is a type of lethal cancer with poor prognosis and
increasing incidence. It is estimated that each year, more than 350,000 people worldwide are diagnosed
with PDAC (McGuigan et al., 2018). Nevertheless, PDAC has a low 5-year survival rate, which stands
at approximately 7.1% (Stark et al., 2016). Hence, PDAC is ranked as the fourth leading cause of
cancer-related deaths (Ilic & Ilic, 2016). In addition, incidence rates vary significantly around the world.
Wong et al. showed that developed countries have higher incidence rates compared to developing
countries (Wong et al., 2017). It has been found that, Europe and North America have the highest age-
standardized incidence rates (Ilic & Ilic, 2016). Moreover, the incidence rate is increasing in the Western
World. Saad et al. found that the incidence rate is increasing by 1.03% per year in the United States after
age adjustment. It is estimated that, by 2030, pancreatic cancer will become the second most common
cause of cancer-related death in the United States (Siegel et al., 2009).
Significant improvements in cancer screening methods and treatment therapies have improved survival
rates for most cancers (Adamska, Domenichini, & Falasca, 2017a; Urruticoechea et al., 2010).
Unfortunately, the survival rate remains almost at the same level for PDAC patients (Adamska et al.,
2017a). In this study, we aimed to develop a CT image-based prognosis model for PDAC patients,
helping healthcare professionals make personalized and efficient treatment plans. In order to develop
this model, it is critical to review recognized PDAC risk factors, treatment options, diagnostic and
prognostic markers. Details of this information will be discussed in the following sections of Chapter 1.
1.1.2 Risk factors
Researchers have identified several risk factors for PDAC, including sex, age, blood group, gut
microbiota, diabetes, smoking, and family history (Arnold et al., 2009; Bosetti et al., 2012; Memba et
al., 2017; Midha, Chawla, & Garg, 2016; Pernick et al., 2003; Rohrmann et al., 2009; Silverman et al.,
2003; Wahi, Shah, Schrock, Rosemurgy, & Goldin, 2009; B. M. Wolpin et al., 2009; Brian M. Wolpin
2
et al., 2010; WOOD et al., 2006). However, it must be noted that some of these risk factors were
identified on small sample case-control studies with inevitable selection bias (McGuigan et al., 2018). In
the following sections, the risk factors identified in the previous academic literature will be explored.
Sex
The incidence rates vary between genders. It has been shown that the worldwide age-standardized
incidence rate is 5.5% for male and 4.0% for female (McGuigan et al., 2018). In developed countries,
the difference is more pronounced. This disparity may be attributed to different levels of exposures to
other risk factors such as smoking and smokeless tobacco use. Notably, a systematic review of 15 PDAC
studies concluded that reproductive factors were not associated with pancreatic cancer in women (Wahi
et al., 2009).
Age
The incidence rates for pancreatic cancer have a positive correlation with age (McGuigan et al., 2018).
90% of the pancreatic cancer patients are over 55 years of age (Midha et al., 2016; WOOD et al., 2006).
For different countries, incidence rate peaks at different ages. In the United States, the majority of the
newly diagnosed patients are in their seventh decade of life, while in India, the disease typically peaks
among patients in their sixth decade (McGuigan et al., 2018; Midha et al., 2016).
Blood group
In a meta-analysis, Wolpin et al. found that, compared to people with blood type O, individuals with
other blood type have higher risks of developing pancreatic adenocarcinoma, namely A (HR: 1.32,
95%CI:1.02-1.72), B (HR: 1.72, 95%CI:1.25-2.38), and AB (HR: 1.51, 95%CI: 1.02-2.23) (B. M.
Wolpin et al., 2009). This finding was confirmed by a follow-up epidemiological study (Brian M.
Wolpin et al., 2010). It was hypothesized that the inflammatory state across different ABO groups and
alternation in glycosyltransferase specificity may explain the disparities (McGuigan et al., 2018; Brian
M. Wolpin et al., 2010).
3
Gut microbiota
Memba et al. found that people with lower levels of Neisseria elongate and Streptococcus mitis, and
higher levels of Porphyromonas gingivalis and Granulicatella adiacens had higher risks of developing
pancreatic cancer (Memba et al., 2017). However, confounding variables cannot be ruled out, and
further studies are required to validate these findings (McGuigan et al., 2018).
Family History
Among all pancreatic cancer patients, 5 to 10% of patients have two or more first-degree relatives who
were previously diagnosed with pancreatic cancer (Hruban, Canto, Goggins, Schulick, & Klein, 2010).
Compared to an individual with no family history, a person, who has one first degree relative with
pancreatic cancer, faces an 80% increase in the risk of developing pancreatic cancer (RR: 1.8, 95%CI:
1.48-2.12) (Permuth-Wey & Egan, 2009). If an individual has three or more first-degree relatives who
were previously diagnosed with PDAC, the individual has a thirty-two times higher risk of developing
pancreatic cancer (Becker, Hernandez, Frucht, & Lucas, 2014).
Diabetes
Steven et al. found that patients with type I diabetes are 200% more likely to develop pancreatic cancer
when compared to the patients without diabetes (RR: 2.00, 95%CI: 1.37-3.01) (Stevens, Roddam, &
Beral, 2007). Similarly, for a patient with type II diabetes, the odds ratio is 1.82, with 95% Confidence
interval at 1.66 to 1.89 (Huxley, Ansary-Moghaddam, Berrington de González, Barzi, & Woodward,
2005). Nevertheless, it must be noted that, PDAC itself can cause diabetes. Hence, it is important to
consider and control the confounding variables in investigating risk factors.
1.1.3 Diagnostic biomarkers
Pancreatic cancer patients are often diagnosed in the late stages which are not resectable and cause low
survival rates. Consequently, early detection of pancreatic cancer is critical in effective treatment and
management. Although several biomarkers have been found, none of them is an ideal candidate due to a
variety of limitations (Loosen, Neumann, Trautwein, Roderburg, & Luedde, 2017). Table 1.1 below lists
the biomarkers and their performances where sensitivity is defined as the probability of a positive test
4
given that the patient has the disease, and specificity is the probability of a negative test from a healthy
person. Early diagnosis is essential for successful PDAC treatment and associated with prognosis of
PDAC patients. It is essential to review the literature in this field. Details of these biomarkers will be
discussed in the following paragraphs.
Table 1.1: List of available biomarkers and their performance for PDAC diagnosis
Biomarker Sensitivity (%) Specificity (%) Reference
CA19-9 81 81 (Y Zhang et al., 2015)
CA50 71.1 93.5 (Liao et al., 2007)
CA72-4 63.4 75.2 (WU et al., 2006)
CA125 66.8 83.3 (Jiang, Tao, & Zou, 2004)
CA242 67.8 83 (Y Zhang et al., 2015)
CEA 39.5 81.3 (Y Zhang et al., 2015)
MIC-1 79.0 86.0 (Y.-Z. Chen et al., 2014)
PAM4 76.0 85.0 (David V. Gold et al., 2013)
miR-21 90.0 66.7 (J.-Y. Yang et al., 2014)
miR-155 76.7 73.3 (J.-Y. Yang et al., 2014)
miR-143 and miR-30e 83.3 96.2 (J.-Y. Yang et al., 2014)
CA19-9 and other carbohydrate antigens
CA19-9 is the abbreviation of carbohydrate antigen 19-9, also known as Sialyl-Lewis. It is the only
biomarker approved by FDA for PDAC diagnosis (Goonetilleke & Siriwardena, 2007). However, in
PDAC diagnosis, CA19-9 has several limitations (Loosen et al., 2017). First, the serum level of CA19-9
not only suggests the presence of PDAC, but may also indicate other medical conditions including
5
pancreatitis, obstructive jaundice, acute cholangitis, and liver cirrhosis (Ballehaninna & Chamberlain,
2012; Perkins, Slater, Sanders, & Prichard, 2003; Satake, Kanazawa, Kho, Chung, & Umeyama, 1985;
Steinberg, 1990). Second, for PDAC diagnosis, CA19-9 has median sensitivity and specificity at 75%
and 77% respectively, indicating that CA19-9 does not qualify as an accurate screening biomarker (Y
Zhang et al., 2015). Last, approximately 5%-10% of the Caucasians have Lewis-null blood type that
does not produce CA19-9. This further limits the usage of CA19-9 as a screening tool (Goonetilleke &
Siriwardena, 2007; Von Rosen, Linder, Harmenberg, & Pegert, 1993).
Other carbohydrate antigens including CEA and CA50, CA195, CA72-4, and CA125 have also shown
diagnostic potential as screening biomarkers (Bünger, Laubert, Roblick, & Habermann, 2011; Y Zhang
et al., 2015). However, performances of these biomarkers are also limited. As shown in Table 1.1, CEA
has 39.5% sensitivity and 81.3% specificity. Further research is required to develop robust diagnostic
screening biomarkers using carbohydrate antigens.
Non-coding RNAs
miRNA stands for microRNA which is a group of non-coding RNA that is involved in post-
transcriptional regulations, targeting and degrading other RNAs with specific sequence (Bartel, 2009;
Lagos-Quintana, 2001). For many types of cancer, miRNA is routinely used as a detection biomarker
(Hong & Park, 2014; Rosenfeld et al., 2008). For PDAC diagnosis, multiple miRNAs showed potential,
including miR-21, miR-155, miR-196a, miR-216, miR-217, and miR-210 (Bloomston et al., 2007;
Caponi et al., 2013; Dillhoff, Liu, Frankel, Croce, & Bloomston, 2008; Schultz et al., 2012; Szafranska
et al., 2007).
These biomarkers were upregulated or downregulated in pancreatic tissue and juice (Hong & Park,
2014; Link, Becker, Goel, Wex, & Malfertheiner, 2012; Sadakari et al., 2010). However, acquisition of
these biomarkers is challenging since it may involve tissue biopsy, which is not appropriate for
screening tests. Recent research from Yang et al. targeted those miRNAs in fecal specimens and
achieved sensitivity and specificity at 83%. Hence, these non-invasive biomarker acquisitions have
significant potentials in PDAC screening process (J.-Y. Yang et al., 2014).
Other non-coding RNAs including IncRNA (MALAT-1, Gas5, MEG3, and HSATII), and snRNA also
have significantly different expression patterns in PDAC patients when compared to healthy individuals
6
(Kishikawa, 2015; Kung, Colognori, & Lee, 2013). Further studies are needed to comprehensively
evaluate the performance of those non-coding RNAs biomarkers.
MIC-1
MIC-1, also known as macrophage inhibitory cytokine 1, shows significant overexpression for multiple
types of cancer (Bootcov et al., 1997; Buckhaults et al., 2001). Koopmann et al. found that, for
pancreatic cancer diagnosis, with an area under ROC (AUC) at 0.99, MIC-1 has a significantly better
performance compared to CA19-9 (Koopmann, 2006). Moreover, when differentiating pancreatic cancer
from chronic pancreatitis, MIC-1 performs as good as CA19-9 (p value = 0.63) (Koopmann, 2006).
As discussed above, the usage of CA19-9 is limited in cases of Lewis-null blood type. Comparatively,
MIC-1 expression is universal. It has been shown that, among individuals in Lewis-null group, MIC-1
has sensitivity at 63% (X. Wang et al., 2014). However, the performance of MIC-1varies in different
studies (McGuigan et al., 2018). A meta-analysis from Chen et al. showed that serum MIC-1 level has
median sensitivity and specificity of 79% and 86% respectively, being significantly lower than results
from Koopman et al. (Y.-Z. Chen et al., 2014).
PAM4
PAM4, a new monoclonal antibody (MAb) known as clivatuzumab, is reactive with Mucin 5AC which
is expressed in pancreatic cancer and precursor lesions (D. V. Gold, Karanjawala, Modrak, Goldenberg,
& Hruban, 2007; David V. Gold, Lew, Maliniak, Hernandez, & Cardillo, 1994; D. Liu, Chang, Gold, &
Goldenberg, 2015). Gold et al. found that, in pancreatic cancer diagnosis tasks, PAM4 reached 76%
sensitivity and 85% specificity, which are significantly higher than CA19-9 (p value = 0.026) (David V.
Gold et al., 2013). Furthermore, combining PAM4 and CA19-9 produces the final model with 84%
sensitivity and 82% specificity (David V. Gold et al., 2013). Further validation in a large cohort is
expected to evaluate the potential of PAM4 as a PDAC diagnosis marker in clinical conditions.
1.1.4 Treatment
It has been found that surgical resection is the only treatment that offers a potential cure for patients with
pancreatic cancer. Surgical options for pancreatic cancer include pancreatic duodenectomy and distal/
total pancreatectomy (McGuigan et al., 2018). However, not every tumor is resectable. The decision is
7
mainly based on the relationship between pancreatic cancer and the surrounding vascular structures
(Lynch et al., 2009; McGuigan et al., 2018). Hence, less than 20% of patients are candidates for surgery
since PDAC often spreads before initial diagnosis (Foucher et al., 2018).
It has been shown that, for patients undergoing surgeries, adding chemotherapy improves overall
survival rate (Foucher et al., 2018; McGuigan et al., 2018). A recent study found that patients who took
adjuvant chemotherapy in addition to surgery had significantly higher median survival durations than
patients underwent surgery alone (André et al., 2015). Given that, finding patients with aggressive
tumors and offer “aggressive” treatments is important. Hence, it would be beneficial to provide accurate
prognoses for resectable PDAC patients and that is the goal of this study.
On the other hand, chemotherapy is the main option for patients with advanced and metastatic PDAC
(Foucher et al., 2018; McGuigan et al., 2018). It has been shown that, chemotherapy can increase
survival rate and relief cancer-related symptoms (Adamska, Domenichini, & Falasca, 2017b). Currently,
clinicians have several chemotherapy options for patients with pancreatic cancer including Gemcitabine/
Abraxane, and FOLFIRINOX. However, these therapies are not effective for all patients. Biomarkers
are needed to develop personalized treatment plans for PDAC patients. These personalized treatment
plans may improve patients’ quality of lives as well as lower expenses.
1.1.5 Biomarker for chemotherapy response
FOLFIRINOX biomarker
FOLFIRINOX is a common chemotherapy regimen for PDAC patients. It consists of leucovorin,
irinotecan, oxaliplatin, and 5-FU (Adamska et al., 2017b; Loosen et al., 2017). It has been shown to be
especially effective for patients with metastatic pancreatic cancer (Adamska et al., 2017b). However,
due to the systemic toxicity of this therapy, its usage is limited for the elder group (Conroy et al., 2011;
“FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer,” 2011; Gourgou-Bourgade et al.,
2013). Consequently, biomarkers are needed so that clinicians can identify patients who will be
benefited from FOLFIRINOX treatment.
In genomic studies, it has been found that patients with inactivation of BRCA1, BRCA2, and PALB2
have better responses to FOLFIRINOX regimen (Waddell et al., 2015). Moreover, high expression
8
levels of CES2 (carboxylesterase 2) in pancreatic cancer tissue are positively associated with survival
among patients receiving FOLFIRINOX (Capello et al., 2015). Follow-up studies are needed to further
validate these findings.
Gemcitabine/ Abraxane markers
For pancreatic cancer patients, another chemotherapy option is Gemcitabine plus Abraxane, which has
been used since 1997 (Kamisawa, Wood, Itoi, & Takaori, 2016). Due to the hydrophilic characteristic, it
is difficult for gemcitabine to diffuse into cells. Thus, activity levels of transporters are important
predictors for patients' responses to gemcitabine. Relevant transporters are concentrative nucleoside
transporters (CNT) and equilibrate nucleoside transporters (ENT) (Farrell et al., 2009; Yamada et al.,
2016).
CNT shifts gemcitabine using sodium gradient between cellular membranes (Greenhalf et al., 2014). It
has been found that, for patients taking Gemcitabine, individuals with high human CNT expressions
have higher survival rates compared to patients with lower CNT expression levels (p value = 0.028)
(Marechal et al., 2009).
Equilibrate nucleoside transporter (ENT) is another potential predictive marker for gemcitabine
response. It has been found that cell lines with high ENT expression have higher sensitivity rates to
gemcitabine in several in vitro studies (Spratlin, 2004). Further, multiple large-scale clinical studies
confirmed that patients with ENT expression have significantly higher median survival rates (Farrell et
al., 2009; Greenhalf et al., 2014).
Another potential biomarker for gemcitabine is deoxycytidine kinase, which converts gemcitabine into
its active form (Loosen et al., 2017). In a small cohort study, it was discovered that high deoxycytidine
kinase has a positive association with the duration of disease-free survival (Fujita et al., 2010;
Sebastiani, 2006). Undoubtedly, further studies are needed to validate and assess the performance of
these biomarkers.
9
1.1.6 Prognostic markers
Prognostic markers are specific patients’ characteristics that can be utilized to predict the course of
diseases (McGuigan et al., 2018). Robust and valid prognostic marker can help healthcare professionals
designing optimal treatment plans in patients' best interests. Serval prognostic markers have been found
in PDAC, and their details are discussed below.
ECOG performance status
ECOG (Eastern Cooperative Oncology Group) performance status is a well-established prognostic
marker for different types of cancers, including PDAC (Sørensen, Klee, Palshof, & Hansen, 1993). The
grading criteria are listed in table 1.2 below. It has been found that patients with high ECOG
performance status grade may not be benefited from combined chemotherapies (Louvet et al., 2005;
Peixoto et al., 2015).
Table 1.2: List of ECOG criteria (Oken et al., 1982)
Grade ECOG Performance status
0 Fully active, able to carry on all pre-disease performance without restriction
1 Restricted in physically strenuous activity but ambulatory and able to carry out work of
a light or sedentary nature, e.g., light housework, office work
2 Ambulatory and capable of all self-care but unable to carry out any work activities; up
and about more than 50% of waking hours
3 Capable of only limited self-care; confined to bed or chair more than 50% of waking
hours
4 Completely disabled; cannot carry on any self-care; confined to bed or chair
5 Dead
SPARC
SPARC (secreted protein acidic and rich in cysteine), also known as osteonectin, is a calcium-binding
glycoprotein. SPARC involves in serval cellular processes, including cell differentiation and
proliferation (McGuigan et al., 2018). Studies showed that SPARC expression has a negative association
10
with survival (C.-S. Wang, Lin, Chen, Chan, & Hsueh, 2004; Watkins, Douglas-Jones, Bryce, E Mansel,
& Jiang, 2005; K. Yamashita, Upadhay, Mimori, Inoue, & Mori, 2003). Additionally, Infante et al.
demonstrated that the location of SPARC is a prognostic biomarker for PDAC (Infante et al., 2007).
Patients with SPARC negative stroma have significant longer median survival rates than patients with
positive SPARC stroma (p value < 0.001) (Loosen et al., 2017).
CA19-9
As discussed above, CA19-9 is a potential diagnostic marker. Moreover, high CA19-9 level also has a
negative association with survival duration (Ballehaninna & Chamberlain, 2012). In a recent study,
using univariate Cox Proportional Hazards Model for overall survival, it has been shown that CA19-9
had a hazard ratio of 1.37 with 95% confidence interval from 1.00 to 1.88 (G. Luo et al., 2017).
However, as discussed above, CA19-9 has several drawbacks which limit its applications. Moreover, the
prognosis performance of CA19-9 is far from ideal.
Quantitative Image biomarkers
As a non-invasive tool, CT is commonly used in PDAC diagnosis and management (Adamska et al.,
2017b). It is used to assess stages and resectabilities. CT is also utilized to assess response to systemic
therapies. Nevertheless, beyond RECIST criteria, quantitative measurements were not routinely used.
Recently, it has been found that several quantitative imaging features are associated with PDAC
prognosis for resectable patients. Eilaghi et al. found that, the quantitative imaging feature
“Dissimilarity” and “Inverse difference normalized” are associated with patients’ overall survival
(Eilaghi et al., 2017). A recent multi-cohort study from Khalvati et al. confirmed the potential of
quantitative imaging features in the PDAC prognosis. It has been shown that "Original_glcm_
SumEntropy" and "squareroot_glcm_ClusterTendency" are associated with overall survival in resectable
PDAC patients (Khalvati, Zhang, Baig, et al., 2019). Quantitative imaging biomarkers have shown
substantial potential in PDAC prognosis. The analytic pipeline of quantitative imaging biomarkers will
be discussed in the following sections of this chapter.
11
1.2 Radiomics: analysis of quantitative imaging markers
1.2.1 Introduction
Modern medicine is moving towards personalized medicine, where diagnosis, treatment, and prognosis
of the disease are modified for each patient. In clinical practice, radiology plays a critical role in
providing valuable information for physicians to detect, differentiate and diagnose abnormal conditions
in patients (Yip & Aerts, 2016). Radiological images contain a vast amount of information on lesions,
including shape and texture. However, human interpretation of medical images alone is potentially
biased and often fails to discover the entirety of potentially informative data.
Radiomics is a new field of study, which aims to discover and translate this un-decoded information
from medical images (V. Kumar et al., 2013). Radiomics is defined as the extraction and analysis of a
large number of quantitative features from the medical images. These features can offer comprehensive
information on texture, intensity, heterogeneity, and morphology(van Griethuysen et al., 2017).
Studying these features, researchers have found that many features have significant associations with
clinical outcomes and gene-expression levels (Yiming Li, Qian, et al., 2018; Papp et al., 2018). These
features can be further used to develop diagnostic or prognostic models which may serve as tools for
personalized diagnosis and clinical decision support systems.
By capturing the entire tumor site, radiomics features have the distinct advantage of assessing tissue
heterogeneity (Gillies, Kinahan, & Hricak, 2015). Other clinical procedures such as biopsy only capture
a small fraction of tumors, having significant chances of missing the index tumor (Khalvati, Zhang,
Wong, & Haider, 2019). Hence, it is a challenging task to get a comprehensive mapping of the tumor
using traditional approaches, leading to misinterpretations and non-optimal clinical decisions.
Comparatively, with the ability of “reading” tumors through 3D or 2D images, radiomics could
potentially overcome this challenge (van Griethuysen et al., 2017). In the past decade, radiomics studies
have been conducted on multiple diseases, including different types of cancers. Through these studies,
radiomics has shown its potential in disease diagnosis, prognosis, and prediction of treatment responses
(Keek, Leijenaar, Jochems, & Woodruff, 2018). Details about traditional radiomics analytics pipeline
will be discussed in the following sections.
Part of this section is modified from:
Zhang. Y et al., Radiomics-based Prognosis Analysis for Non-Small Cell Lung Cancer, Sci Rep, 2017.
12
1.2.2 Pipeline
Radiomics analytics pipeline consists of several stages (Khalvati, Zhang, Wong, et al., 2019). Figure 1.1
below shows a typical pipeline for a radiomics study. First, raw images are pre-processed and segmented
to annotate Regions Of Interest (ROIs) such as cancerous regions (tumors) (Yucheng Zhang,
Oikonomou, Wong, Haider, & Khalvati, 2017). This is usually done manually or through automatic
segmentation algorithms. Next, a large number of quantitative imaging features is extracted from these
ROIs (Yucheng Zhang et al., 2017). Last, endpoint data (i.e., clinical outcomes such as disease
recurrence) is entered into the database, providing information for feature selection and model building
process.
Figure 1.1: Traditional Radiomics Pipeline
As mentioned above, radiomics studies need input as raw medical images and patients’ outcomes.
Images come from different modalities including computed tomography (CT), magnetic resonance
imaging (MRI), and positron emission tomography (PET). Raw images from these modalities are often
saved as DICOM file (Digital Imaging and Communications in Medicine), which contains the images
and “header” information. Currently, most scientific programming languages can read DICOM images
using specific packages or modules, enabling the further steps in the radiomics analytics pipeline.
1.2.3 Segmentation
As the first step, segmentation provides the ROIs’ boundaries, which are typically the lesion presented
on the medical images. In addition, it has been found that segmenting not only the lesions but also
13
peripheral zones around the lesions can boost the performances (Hambarde et al., 2019). As discussed
above, segmentation of the lesions is usually performed manually by radiologists. This is not only time-
consuming but also introduces undesirable variations (Owens et al., 2018). It has been found that several
radiomics features are sensitive to variations in segmentation (Owens et al., 2018). A recent multi-reader
study confirmed that some radiomics features have low inter-reader reliability (Khalvati, Zhang, Baig, et
al., 2019). These findings show the critical need for developing reliable and automated segmentation
methods. Although radiologists’ contours are still considered as gold standards, automated segmentation
methods have been developed rapidly in the past few years. Details of these segmentation methods are
discussed below (Litjens, Kooi, Bejnordi, Setio, et al., 2017; Oktay et al., n.d.; Razzak, Naz, & Zaib,
n.d.).
Traditional thresholding-based segmentation uses pre-defined thresholds (Abdullah, Hambali, & Jamil,
2012). Pixels which have higher or lower values than the threshold are selected and labelled. This
approach needs prior knowledge of the image information and modality (Litjens, Kooi, Bejnordi, Setio,
et al., 2017). With an accurate threshold, segmentation could achieve acceptable performance on
“simple” tasks including lung and bone segmentation (Owens et al., 2018).
Although this method is intuitive, performance is limited due to the complex nature of human tissues.
Modern threshold-based segmentation often contains edge-based, region-based, or hybrid modifications
after the initial segmentation (Leo, Lim, & Suneetha, 2009; Sharma et al., 2010). Edge-based
modification methods use non-maximal suppression and hysteresis thresholding to suppress pixels
which are potential outliers, smoothing the boundary and eliminating holes inside the ROI.
In region-based segmentation, a radiologist would provide an initial point of the lesion. Then, all the
pixels adjacent to the point and having similar intensities would be selected and labelled (Junfeng &
Yunyang, 2012). Although it has been shown that the region-based approach has superior performance
in tumor segmentation, this approach needs additional manual inputs, and extensive pre-processing,
which limit its applications. Additionally, since these segmentation algorithms solely depend on pixel
intensities, artifacts and partial volume effects have significant impacts on their results. Consequently,
due to these limitations, researchers have started to investigate other segmentation methods including
deep learning based segmentation which will be discussed in the following sections (Kaur & Kaur,
2014).
14
1.2.4 Feature extraction
The second step of the radiomics pipeline is feature extraction. In general, features can be categorized
into two groups, "semantic" and "agnostic"(Gillies et al., 2015). Semantic features are commonly used
by radiologists, including, size, shape, location, vascularity, and attachments. In contrast, agnostic
features are quantitative features describing the texture and intensity distributions (Khalvati, Zhang,
Wong, et al., 2019). Table 1.3 below summarized common features in recent radiomics studies.
Table 1.3: List of common features (Lambin et al., 2017a)
Semantic First order Second order Higher order
Shape Mean Heterogeneity Fractal dimensions
Location Median Haralick textures Wavelets
Vascularity Entropy
Laplacian
Researchers have worked for decades to expand the feature banks and extract more useful information
from medical images (Aerts et al., 2014). During the past decade, the size of a typical radiomics feature
bank has expanded from less than one hundred to more than few thousands (van Griethuysen et al.,
2017). A more comprehensive feature bank helps to identify more quantitative imaging markers for
diagnosis and prognosis (Aerts et al., 2014; V. Kumar et al., 2013; Parekh & Jacobs, 2016). At the same
time, a higher number of features increases the complexity of the feature map and induces the danger of
false positives or overfitting (Yucheng Zhang et al., 2017).
In addition, researchers often have developed in-house feature banks based on different programming
languages including Python, MATLAB, or C++. Although the features' names are the same, it is
common that the formulas are slightly different, making studies unreproducible (Khalvati, Zhang, Baig,
et al., 2019). PyRadiomics, as an open source feature extraction tool, was developed to address these
challenges(van Griethuysen et al., 2017). It enables basic pre-processing and provides a comprehensive
feature bank for researchers in the radiomics field (Lambin et al., 2017; van Griethuysen et al., 2017).
Currently, the PyRadiomics library implements 120 features (van Griethuysen et al., 2017). These
features can be extracted from the original image, or images derived through other filters (e.g., High
pass filters, or low pass filters). The PyRadiomics library has 19 first order features, including energy,
15
total energy, entropy, minimum, 10th percentile, 90th percentile, maximum, mean, median, interquartile
range, range, mean absolute deviation, robust mean absolute deviation, root mean squared, standard
deviation, skewness, kurtosis, variance, and uniformity (van Griethuysen et al., 2017). These features
describe the distribution of pixel intensities in the ROI. Among these first order features, entropy, which
is a measurement of randomness in image values, has been found to be significantly associated with
overall survival of cancer patients in multiple studies (Ganeshan, Abaleke, Young, Chatwin, & Miles,
2010; Ganeshan, Panayiotou, Burnand, Dizdarevic, & Miles, 2012; Y. Huang et al., 2016; Yucheng
Zhang et al., 2017).
Additionally, PyRadiomics provided formulas for 75 texture features including Sum Entropy and
Cluster Tendency, which have been shown to be significantly associated with PDAC prognosis
(Khalvati, Zhang, Baig, et al., 2019).
1.2.5 Feature analysis and model building
Using open source libraries such as PyRadiomics, researchers are able to extract thousands of features
from a given ROI (van Griethuysen et al., 2017). Following that, the third step of the radiomics analytic
pipeline is feature analysis and model building (Parmar, Grossmann, Bussink, Lambin, & Aerts, 2015;
Yucheng Zhang et al., 2017). Although a vast number of quantitative features can be extracted from
medical images, many of them are simply noise, or highly correlated with other features (Yip & Aerts,
2016). Hence, feature reduction is critical to select useful and unique features, minimizing the
computational cost while increasing the prediction accuracy (Yucheng Zhang et al., 2017).
In general, feature reduction procedures can be categorized as supervised or unsupervised methods
(Parmar, Grossmann, et al., 2015). In supervised feature selection, such as filtering feature selection,
features are selected based on their discriminative value of outcomes. Conventional supervised feature
selection methods include parametric or semi-parametric tests such as t-tests, u-test, and Cox
Proportional Hazards Model (Yucheng Zhang et al., 2017). For binary outcomes, researchers often
compare the distribution of features for positive and negative groups such as disease recurrence and non-
recurrence groups. If these two groups have a significant difference in terms of feature value, then the
feature will be considered useful. Based on different types of outcomes and assumptions (binary or
16
multinomial, normal distribution or non-normal distribution), ANOVA, t-tests, or Wilcoxon U tests are
applied accordingly (Yucheng Zhang et al., 2017). In early radiomics studies, many researchers failed to
check the assumptions of these tests(Coroller et al., 2015a). Furthermore, although these tests are
straightforward, the multiple testing problem is inevitable with fast-growing feature space (Yip & Aerts,
2016). Consequently, these limitations restrict the applications of supervised feature selection methods.
In contrast, unsupervised feature reduction is based on dimensionality reduction algorithms, maintaining
more information in the dataset. Among non-filtering feature selection methods, Principal Component
Analysis (PCA) is the most popular approach. It selects a small number of uncorrelated variables, called
“principal components”, which could explain most of the variation in the data (Abdi & Williams, 2010).
A similar approach is called Independent Component Analysis (ICA), which removes not only
correlations among the variables, but also higher-order dependencies. Other common unsupervised
feature selection methods are zero variance (ZV) and near zero variance (NZV). These two algorithms
remove features with zero or near zero variance (Kuhn, 2008). In radiomics studies, NZV and ZV are
particularly practical. When the ROI is minimal (e.g., 4 pixels), the open source libraries might fail to
extract meaningful features, resulting in columns of zeros or missing values. In this condition, ZV and
NZV methods are extremely valuable since they can efficiently remove those features (Yucheng Zhang
et al., 2017).
After selecting useful features, model building is the last step in the traditional radiomics analytics
pipeline. Radiomics-based prognosis models utilize the quantitative imaging features for predictions of
outcomes (e.g. Survival vs. Death) (Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). In
the machine learning domain, classification is considered as a supervised learning task of inferring a
function from labelled training data (Yucheng Zhang et al., 2017). The classification algorithm analyzes
the training data and outcomes (labels), minimizing the lost function and building predictive models.
Common classification models in radiomics studies include Random Forest and generalized linear
model. The Random Forest model is generally developed by building hundreds of small decision trees
(Breiman, 2001; Hawkins et al., 2016). Each decision tree receives a subset of the full data. Under this
condition, although each tree has limited predictive power, the ensembled forest gains the ability to
classify outcomes. The Random Forest model has several advantages. For most classification tasks,
Random Forest works well without tuning any parameters (Parmar, Grossmann, et al., 2015).
Additionally, due to the subsampling, Random Forest tends not to overfit. Finally, a Random Forest
17
model can handle not only linear features but also non-linear or categorical features, making it suitable
for radiomics studies.
However, since training a Random Forest is similar to a black box process, logistic regression is often
preferred as an intuitive classification method (Fernández-Delgado, Cernadas, Barro, Amorim, &
Amorim Fernández-Delgado, 2014; H. Wang et al., 2010). Ordinary linear regression fails to model the
probabilities of binary outcomes, since probabilities range from 0 to 1. As a type of generalized linear
model, logistic regression does classification by applying a logit transformation of probability, extending
its range (Sperandei, 2014). In general, logistic regression is easy to understand and requires fewer data
to achieve acceptable performance. Given that, researchers often choose between Random Forest and
generalized linear model when they are trying to build a radiomics based classification models.
It is worth to note that many clinical outcomes have unbalanced ratios (e.g. survival outcome for cancers
with poor prognosis), which do not meet the assumption of balanced endpoints for most machine
learning algorithms. To tackle this problem, subsampling methods, including down-sampling, up-
sampling, and Synthetic Minority Over-sampling Technique (SMOTE), are applied (Blagus et al.,
2013). Down-sampling method down-samples the “majority” cases during model training while up-
sampling method up-samples the minority cases. These two methods are intuitive but either lose
information or create a “non-universal decision region” since the generated data points are duplicates. It
has been shown that in radiomics studies, these two methods are not beneficial for the prognosis models
(Yucheng Zhang et al., 2017).
On the other hand, as an enhanced sampling method, SMOTE creates “simulated samples” based on
Euclidian distance for variables (Blagus et al., 2013). As a result, the synthetic cases have attributes with
values similar to the existing cases and are not merely replications as provided by oversampling. Thus,
SMOTE can effectively increase the representation of the minority class while reflecting the structure of
the original samples. Zhang et al. showed that, in radiomics based prognosis model, adding SMOTE will
significantly improve the model’s performance (Yucheng Zhang et al., 2017).
1.2.6 Current progress
In the following sections, recent radiomics studies for cancer diagnosis, prognosis, or treatment response
will be discussed.
18
Lung Cancer
A large number of representative radiomics studies are based on lung cancer. For lung cancer diagnosis,
Kumar et al. trained a radiomics feature based classification model using CT images and achieved
sensitivity and specificity of 79.6% and 76.1% respectively (D. Kumar et al., 2015). In another study,
researchers trained a radiomics feature based classification model using low-dose CT to predict
malignant nodules, achieving an accuracy of 80% (Hawkins et al., 2016; Y. Liu et al., 2017). The
features from CT images also showed significant association with TNM staging. In a study of 1019
patients, Aerts et al. found that, 238 features have associations with cancer staging (Aerts et al., 2014;
Parmar, Leijenaar, et al., 2015). A recent publication from Zhou et al. confirmed these findings (H. Zhou
et al., 2018). Also, radiomics studies for lung cancer are not only limited to CT images but also extend
to PET/CT as well. Wu et al. found that features from PET images are also associated with cancer
staging (Wu et al., 2016).
Head and neck
Similar to lung cancer, several radiomics studies were conducted for head-and-neck cancer. It has been
found that CT and MR based features have significant associations with staging in head and neck
cancer, primarily features from contrast-enhanced T1-weighted (T1w) MR and T2-weighted (T2w) MR
images (Ren et al., 2018; Z. Zhou et al., 2018). For Nasopharyngeal Cancer (NPC), it has been found
that radiomics features derived from MRI show prognostic values (B. Zhang et al., 2017a).
Additionally, few features also showed significant association with patients’ responses to chemotherapy
and radiotherapy (Gabryś, Buettner, Sterzing, Hauswald, & Bangert, 2018). Further research showed
that, compared to traditional models using clinical factors, radiomics models have better prognosis
performance for patients with high-grade osteosarcoma. These findings further confirmed the potential
of radiomics in translational research and precision medicine (B. Zhang et al., 2017b).
Brain tumors
It has been shown that several genetic markers are associated with prognosis, including P53, ATRX, and
MGMT (Kickingereder et al., 2018; Yiming Li, Liu, et al., 2018; Yiming Li, Qian, et al., 2018; Xi et al.,
2018). A recent study found that, adding radiomics features to the genomics prognosis model further
improved its performance (Itakura et al., 2015). As a quantitative description of tumors’ phenotypes,
19
some radiomics features are also associated with these genomics markers (Itakura et al., 2015). Bai et al.
showed that, without adding genomics information, radiomics features alone can provide accurate
staging in brain tumors (Bai et al., 2016). Recent studies confirmed these findings in prognosis for
gliomas using features from PET and MRI (Papp et al., 2018; Pérez-Beteta et al., 2018).
Colorectal cancer
Determining genetic mutation is an essential step in colorectal cancer management as stated in NCCN
(National Comprehensive Cancer Network) guideline (Benson et al., 2018). However, genetic testing
has an extra cost and introduces unfavourable waiting time for cancer patients. It has been shown that
radiomics features may solve the problem. Using radiomics features from preoperative CT images, Yang
et al. built a classifier for these genetic mutations with AUC at 0.87 (L. Yang et al., 2018). With larger
sample size and multi-cohort validation, radiomics features based models have the potential to replace
genetic testing in colorectal cancer, saving both time and money for patients.
Knowing patients’ responses to chemotherapy is also vital for healthcare professionals in designing
personalized treatment plans. It has been shown that, 15%-27% of patients achieved complete responses
to chemotherapy or radiation therapy, avoiding surgery (Maas et al., 2010; Sanghera, Wong,
McConkey, Geh, & Hartley, 2008). However, assessing the patient’s response to colorectal cancer is
challenging. Using radiomics features from T2w and (diffusion-weighted images (DWI), several
radiomics features based models achieved high AUC ranging from 0.93 to 0.98 (Horvat et al., 2018;
Nie et al., 2016). These findings suggest the potentials of using radiomics features to assess patients’
responses to therapies before surgery.
Besides assessing responses, radiomics models were also built to differentiate low-risk and high-risk
patients with colorectal cancer. A recent study found that radiomics model was able to differentiate high
or low-risk patients based on their preoperative CT with AUC of 0.84 (Meng et al., 2018).
Pancreatic cancer
Several radiomics studies have been conducted in pancreatic cancer domain. In a single cohort study,
Eilaghi et al. found that features named “dissimilarity” and “inverse difference normalized” are
associated with overall survival in patients with resectable Pancreatic Ductal Adenocarcinoma (Eilaghi
20
et al., 2017). In other studies, radiomics features were found to be predictive of patients’ responses to
chemoradiation therapy (X. Chen et al., 2017; Cozzi et al., 2019). In a recent multi-cohort study,
Khalvati et al. found that features from the PyRadiomics feature bank can be fused into a signature
which is predictive for overall survival (Khalvati, Zhang, Baig, et al., 2019). Further validations of these
features and signatures are needed to assess their prognosis performance.
1.2.7 Limitations of traditional radiomics analytic pipeline
Although previous studies have found several radiomic features which have significant associations with
clinical outcomes including survival or recurrence for different types of cancer, traditional radiomics
pipeline have a few drawbacks including multiple testing, sample size, performance, interpretability,
reproducibility, and reliability (Lambin et al., 2017; Yip & Aerts, 2016).
Multiple testing, also called multiple comparison problem, is one of the common flaws of radiomics
studies. It occurs when researchers are conducting a set of statistical inference simultaneously, inducing
potential false positive findings (Yucheng Zhang et al., 2017). Since feature banks are large, thousands
of features are extracted and tested. Although a higher number of features provides more information
about medical images, the number of testing also goes up, making the multiple comparison problem
even worse. Setting α as 0.05, we expect to see five significant results from 100 tests using random data.
Hence, in the radiomics field, since the number of features is usually large (e.g., above 1000), the impact
of the multiple testing problems is significant and unavoidable. It is more problematic when we consider
the probability of meeting at least one false positive. Given α as the false positive rate for a single test,
and m representing the number of testing, the formula for calculating the error is shown below:
𝐹𝑎𝑚𝑖𝑙𝑦 − 𝑤𝑖𝑠𝑒 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 (𝐹𝑊𝐸𝑅) = 𝑃𝑟𝑜𝑏(𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = 1 − (1 − α)𝑚
It is clear that, with 100 tests, the chance of meeting at least one false positive is 0.9941. Nevertheless, in
most radiomics studies, the number of tests is much higher than 100. Hence, without multiple testing
control, there is a high chance of having false-positive findings. Bonferroni, as a standard multiple
testing control method, was designed to control this FWER (“Etymologia: Bonferroni correction.,”
2015). Given the probability formula,
𝐹𝑊𝐸𝑅 = 1 − (1 − α)𝑚
21
we can derive that:
α′ = 1 − (1 − 𝐹𝑊𝐸𝑅)1/𝑚
Under Bonferroni’s correction, we would reject the null hypothesis when the p value is below α′. Given
m = 100, and FWER at 0.05, Bonferroni method gives a new 𝑎′ as 0.000513. In a typical radiomics
study where more than one thousand features were tested simultaneously, the critical value will be very
small to ensure the family-wise error rate (FWER) is still maintained on 0.05 level.
However, Bonferroni correction assumes that each test is independent, which is not necessarily true in
radiomics studies since many features share similar formulas. In some cases, a feature can be a linear
combination of other features. Under this condition, the Bonferroni correction may be too conservative,
leading to more false negatives (Type II errors). Thus, in recent studies, an increasing number of
researchers used FDR (False Discovery Rate) control defined by Benjamini and Hochberg (S.-Y. Chen,
Feng, & Yi, 2017; Horvat et al., 2018). The definition of false discovery rate is presented below:
𝐹𝐷𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
FDR control offers an approach which may increase testing power while setting up a limit for error rate
(S.-Y. Chen et al., 2017). In practice, the threshold of FDR control can be calculated as:
𝑇𝐵𝐻 = max {𝑃𝑖; 𝑃𝑖 ≤ 𝑎𝑖
𝑚, 0 ≤ i ≤ m}
Compared to Bonferroni, FDR control offers more power, which is favoured by researchers in the
radiomics field. Several studies have been published using the FDR control (Coroller et al., 2015;
Khalvati, Zhang, Baig, et al., 2019). A systematic review paper also suggested that, in future radiomics
studies, FDR control should play an important role (Parekh & Jacobs, 2016; Yip & Aerts, 2016).
Furthermore, with a large number of features, radiomics studies generally have small sample sizes,
leading to the “Large P, small N” problem, where the number of features is much larger compared to the
sample sizes (Yucheng Zhang et al., 2017). Most radiomics studies have samples which have less than
22
500 patients. A recent study was published using images from only eight patients (Nguyen et al., 2018).
Limited sample size limits the statistical power for tests, and further hinders the performance of any
radiomics based models. A list of recent studies is presented in Table 1.4 along with the sample size and
performance information.
Table 1.4: List of recent radiomics studies and their performance in AUC
Domain Sample size Performance Reference
Pancreatic (PET) 139 For Overall survival, AUC: 0.66 (Cui et al., 2016)
Breast (MR) 89 For cancer recurrence, AUC: 0.88 (H. Li et al., 2016)
Lung (CT) 282 For Overall Survival, AUC: 0.72 (Y. Huang et al., 2016)
Lung (CT) 113 For distant metastasis, AUC: 0.67 (Huynh et al., 2016)
Lung (CT) 182 For distant metastasis, AUC: 0.61 (Coroller et al., 2015)
Lung (CT) 196 For cancer screening, AUC: 0.83 (Y. Huang et al., 2016)
Lung (CT) 422 For Overall Survival, AUC = 0.65 (Aerts et al., 2014)
Colorectal (PET) 326 For Overall Survival, AUC = 0.74 (Y.-Q. Huang et al.,
2016)
Oesophageal (CT) 106 For treatment response, AUC =
0.75 (Cunliffe et al., 2015)
Oesophageal (PET) 217 For Overall Survival, AUC = 0.77 (van Rossum et al.,
2016)
It is clear that most studies have sample size under 500 and AUC below 0.8. In terms of performance,
most studies are still far from the clinical standard. Multi-center collaborations may solve the problems,
and several recent studies were conducted in this manner (Aerts et al., 2014; Khalvati, Zhang, Baig, et
al., 2019). However, ethics approvals and protection of patients’ privacy make it difficult to conduct
multi-center studies.
23
Interpretability is another issue with current radiomics studies. Though hundreds of significant features
have been found, researchers and clinician still have limited understanding of the biological nature of
these features (Gillies et al., 2015). Compared to semantic features, radiomics features generally lack
visualized descriptions (Morin, 2018). This makes healthcare professionals more reluctant for the
clinical integration of radiomics (Morin, 2018). Without a doubt, future research for radiomics should
focus on this limitation.
Last but not least, reproducibility and reliability are other limitations for Radiomics studies (Traverso,
Wee, Dekker, & Gillies, 2018). As discussed in the pipeline, radiomics studies involve image
acquisition, segmentation, feature extraction and feature analysis. The complex process adds a
significant amount of variation (Khalvati, Zhang, Baig, et al., 2019; B. Zhao et al., 2016). Different
centers have different CT or MRI scanners which might have different signal to noise profiles.
Additionally, manual segmentation heavily depends on the experience of radiologists. Furthermore,
different feature banks or different programming languages may also affect feature extraction (van
Griethuysen et al., 2017). Finally, feature preprocessing before the analysis, and parameters used in the
classification model would affect the model’s performance as well. In the end, these variations lead to
non-reproducible studies (Lambin et al., 2012). Fortunately, researchers have realized the issue and
started working on IBSI (Image Biomarker Standardization Initiative) and Radiomics Quality Score
System (Sanduleanu et al., 2018; Zwanenburg, Leger, Vallières, & Löck, 2016). These efforts would
improve the research quality and reproducibility of radiomics studies.
1.3 Deep learning in medical imaging
1.3.1 Neural Network and CNN
As discussed above, radiomics has been developed for decades and the performances of the radiomics
model are closing to their plateau (Lao et al., 2017). As deep learning has gained public attention, deep
learning techniques are playing a more important role in medical imaging studies (Litjens, Kooi,
Bejnordi, Setio, et al., 2017; Thomaz, Carneiro, & Patrocinio, 2017; van Griethuysen et al., 2017; R.
Yamashita et al., 2018). As a deep learning architecture which has specialized for imaging-related tasks,
24
Convolutional Neural Networks (CNNs) have become the preferred method in medical imaging studies
(R. Yamashita et al., 2018).
Development of CNNs started in 1962 when Hubel and Wiesel found that some neurons in the visual
cortex of brain only respond to edges of certain orientation (Hubel & Wiesel, 1968). In 1980, inspired by
this, Fukushima proposed a self-organizing neural network model for pattern recognition (Fukushima,
1980). Furthermore, using backpropagation, Yann LeCun developed the LeNet, which is considered as
the predecessor of modern CNNs models (LeCun et al., 1990). However, the performances of early
CNNs were limited. Although CNNs were tested to be effective in handwritten digit recognition tasks,
traditional feature-based machine learning model performs better in general tasks.
In 2012, AlexNet from Hinton lab reversed this trend. By introducing new activating function “ReLU”
and dropout, AlexNet was deeper (having more layers) compared to previous CNN models. In
ImageNet-2012, AlexNet achieved top-5 error rate at 18.9%, which is significantly lower than that of the
previous models (Krizhevsky et al., 2012). The success of AlexNet changed scientists’ minds and
induced a “deep learning revolution”. To better implement CNN architectures in medical image studies,
understanding the components of the Convolutional Neural Networks is important. As such, a detailed
discussion will be provided in the following sections.
A typical CNN consists of multiple layers, including convolutional layers, pooling layers, and fully
connected layers. As input images going through these layers, images are converted into feature maps,
enabling the CNN to make classifications (B, 2013; Krizhevsky et al., 2012). A simplified example of
the CNN architecture is shown in Figure 1.2 below.
25
Figure 1.2: Typical CNN architecture
Convolution layer
Convolution layer is the foundation of the CNN architecture (Krizhevsky et al., 2012). Convolution
stands for a linear operation where a small array, kernel, is applied across the input images. Since digital
images are saved in an array of numbers, convolution operation would generate a feature map as shown
in Figure 1.3 (R. Yamashita et al., 2018).
26
Figure 1.3: Graphical presentation of convolution operations
A. Convolution operations for a 55 input tensor, step 1
B. Convolution operations for a 55 input tensor, step 2
C. Convolution operations for a 55 input tensor, step 9
Results of convolution operations can be influenced by different parameters, including weights in the
kernel, stride, size of the kernel, and padding (R. Yamashita et al., 2018). First, changing the weights in
a kernel will change the final feature map. In the training process, weights in kernels would be tuned so
that the generated features would provide useful information. Second, since the kernel moves on the
input image in a step by step manner, the distance between each step, which is defined as stride, is
critical (Dů et al., n.d.). A larger stride will induce faster down-sampling of the input image. Third, the
27
size of the kernel is also an important hyperparameter. It ranges from the most common 3 × 3 to 5 × 5 or
even 7 × 7. Smaller kernel size will generally generate local features, and slowly reduce image
dimensions, allowing the networks to be deeper which usually offers better performance (H. Liu, Li, Lv,
& Huang, 2017). However, larger kernels have a larger receptive field, and reduce image dimensions
quickly. Taking an example from Figure 1.3, to reduce this 5×5 image tensor to 1×1, one can choose
between two layers of 3×3 kernel or one 5×5 kernel. The former approach is more popular since it has
fewer weights (2×3×3 compared to 5×5) and offers another layer which may provide better performance
(Litjens, Kooi, Bejnordi, Setio, et al., 2017).
Last but not least, padding is another critical factor. Padding was developed to control the dimension
reduction of input images, (Krizhevsky et al., 2012; R. Yamashita et al., 2018). Zero-padding is the
most common type of padding. It adds columns and rows of zero on each side of the input images as
shown in the Figure 1.4 below (R. Yamashita et al., 2018). After padding, convolution operations can be
done without reducing image dimensions, so that the model can afford deeper layers.
Figure 1.4: Graphical representation of zero-padding
Activation layer
Feature maps generated by convolution layers will often pass through the following activation layers (R.
Yamashita et al., 2018). The most common activation function is ReLU (rectified linear unit) which
gives output based on the following formula:
28
𝑓(𝑥) = max (0, 𝑥)
The rationale behind ReLU is that it provides a non-linear transformation (Krizhevsky et al., 2012; R.
Yamashita et al., 2018). Without the non-linear activations, the deep learning network is essentially a
linear model, which is not suitable for solving real-world non-linear relationships. Other common non-
linear activation functions include Sigmoid or logistic activation function, hyperbolic tangent activation
function, and derivatives from the original ReLU. As discussed above, the logit transformation extends
the range. Thus, a “reversed logit transformation”, also called sigmoid function, would restrict the range
based on the following formula:
𝑓(𝑥) = 1
1 + 𝑒𝑥𝑝 (−𝑥)
Pooling layer
Pooling layers are commonly applied in CNNs to reduce the number of parameters that need to be
trained. Max-pooling is one of the most common pooling operations, which extract patches from input
feature images and return the maximum value in each patch (R. Yamashita et al., 2018). The most
frequently used patch size is 2×2, which down-samples the dimension by 2, inducing a significant
reduction of trainable parameters.
Figure 1.5: Graphical representation of max pooling
29
Fully Connected layer
Figure 1.6: Graphical representation of Fully Connected Layers
Through a series of convolution, activation, and pooling layers, input images will be transformed into
2D or 3D feature maps. These feature maps will be flattened as shown in Figure 1.6 and used as input of
fully connected layers (FC) (Krizhevsky et al., 2012; LeCun, Bengio, & Hinton, 2015). FC was named
as fully connected layers since neuron in FC has full connections to all activations in the previous layer.
For each neuron in FC layers, it will process input vector x and return the output using the formula
below:
𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑔 (𝑊𝑥 + 𝑏)
In the formula, x is the input vector. b stands for bias vector, while W is a weight matrix. Finally, g is
the activation function. Through a series of calculations, final layers of FC can generate probabilities or
classifications of the target outcomes.
30
Training a CNN
As discussed above, the weights in kernels and FC layers have a significant impact on the final outputs.
Hence, mathematically, training a CNN means finding optimal weights so that the difference between
the model’s output and ground truth can be minimal. Loss functions are mathematical formulas to
measure the difference (LeCun et al., 2015). The most common loss function for classification is Cross
Entropy Loss, and the formula is presented below:
𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠 = −(𝑦𝑖 log(�̂�𝑖) + (1 − 𝑦𝑖) log(1 − �̂�𝑖))
Where 𝑦𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑎𝑠 0 𝑜𝑟 1 𝑎𝑛𝑑 �̂�𝑖 𝑠𝑡𝑎𝑛𝑑𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
It is clear that cross entropy loss penalizes confident but wrong predictions. Hence, minimizing cross
entropy loss will shape the model’s output to be similar to ground truth (R. Yamashita et al., 2018). In
practice, model performance can be measured by loss functions in a forward manner using training data.
While backpropagation and gradient descent algorithms allow the model to update weights in kernels
and FC through this process (Lecun, Bottou, Bengio, & Haffner, 1998). The gradient descent algorithm
can be described by the following equation.
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 { 𝑤 ← 𝑤 − 𝛼∂L
∂w}
Figure 1.7: Graphical representation of the gradient descent algorithm
31
In this equation, L is the loss function that needs to be minimized. On the other hand, w is the weight
vector, while 𝛼 stands for learning rate. For every iteration, weights will be updated by subtracting the
gradient of the loss function with respect to the weights (R. Yamashita et al., 2018). It is clear that, for
larger 𝛼, weights will be updated in a larger step towards minima. However, a large learning rate can
overshoot the minimum, failing to converge or even diverge. It is worth to note that, if ∂L/ ∂w is small,
learning will also be slow. In other words, if the gradients are “vanished”, the model will not be trained
successfully (He, Zhang, Ren, & Sun, 2015). In a deep learning model, since the gradients of early
layers are obtained by multiplying the gradient of later layers, the gradient vanishes quickly. Hence, a
cap exists for the depth of traditional CNNs.
1.3.2 ResNet
It was hypothesized that, if a CNN can have a higher depth, the network will be able to extract and
utilize more complex features from images. However, it has been found that a 20-layer CNN has a lower
error rate compared to that of a 56-layer CNN (He et al., 2015). This is due to the vanishing gradient
problem as discussed above. To address this, He et al. developed a new architecture called residual
block as shown below (He et al., 2015).
Figure 1.8: Graphical representation of identity path (He et al., 2015)
Compared to traditional CNNs, residual blocks have another connection called “identity shortcut
connection” which skips layers and transmits information. By doing so, the vanishing gradient is
controlled. In image recognition tasks, 34 layers ResNet outperforms other traditional CNNs by a
significant margin (He et al., 2015). In ImageNet classifications, ResNet achieved top-5 error rate at
3.57% which is far better than that of its predecessor: AlexNet (Krizhevsky et al., 2012).
32
1.3.3 Transfer learning
As discussed above, training a CNN means tuning weights in kernels and FC layers. However, since
there is a large number of learnable parameters, large sample size is required to successfully train a
CNN. It has been shown that the scale of a deep learning model and the size of the required data has a
linear relationship (Tan et al., 2018). To solve the complex tasks in the medical imaging domain, a large
amount of data is needed. However, collecting clinical data is time-consuming and expensive (Tan et al.,
2018). As discussed above, small data problem has become a critical obstacle in most medical imaging
studies, especially for rare diseases (Yip & Aerts, 2016). Even for common diseases, ethical approvals
and expert annotations are required before any experimental studies. Without a doubt, this is a time-
consuming process. Transfer learning, as a recently developed method, may offer an alternative solution
to the sample size limitation.
Transfer learning is defined as improving the learning in the target task by leveraging knowledge from
the source domain (Torrey & Shavlik, n.d.). It relieves the need for a large sample size, enabling
researchers to train a successful model using limited data (Tan et al., 2018). Tan et al. provided a
mathematical definition of transfer learning as shown below:
Definition of transfer learning: Given a learning task 𝑇𝑡 based on 𝐷𝑡, and we can
get help from 𝐷𝑠 for the learning task 𝑇𝑠. Transfer learning aims to improve the
performance of the predictive function 𝑓𝑡(·) for learning task 𝑇𝑡 by discovering and
transferring latent knowledge from 𝐷𝑠 and 𝑇𝑠 where and/ or 𝑇𝑠 ≠ 𝑇𝑡. In addition, in
the most case, the size of 𝐷𝑠 is much larger than the size of 𝐷𝑡 , 𝑁𝑠 > 𝑁𝑡 (Tan et
al., 2018).
According to Tan et al., deep transfer learning methods can be divided into four categories, namely,
instance-based deep transfer learning, mapping-based transfer learning, network-based transfer learning,
and adversarial based transfer learning (Tan et al., 2018). In medical imaging-related tasks, network-
based deep transfer learning is the most relevant. Details of this transfer learning method will be
discussed below.
The network-based transfer learning is defined as reusing a part of, or full network pre-trained in the
source domain, including the structure and weights (Tan et al., 2018). It applies to most CNN based
deep learning models since convolution layers can be considered as feature extractor (Krizhevsky et al.,
2012; Lao et al., 2017; LeCun et al., 2015). Hence, network-based transfer learning is practical for
33
image related tasks. A CNN can be pre-trained on a large dataset namely ImageNet, which contains 14
million images. By doing so, the CNN can extract useful information of shapes, texture and other
features from images using optimized kernels. This ability can be transferred to a new model with a
small target domain by adopting the convolution layers. Depending on the sample size and similarity
between target and source domain, network-based transfer learning can be processed in two ways, fine-
tuning method and fixed feature extraction method as shown in Figure 1.9 (R. Yamashita et al., 2018).
Figure 1.9: Graphical representation of transfer learning in CNN
The fixed feature extraction method is straightforward by freezing convolutional base of the model and
using these convolution layers as a feature extractor (D. George, Shen, & Huerta, 2017). As discussed
above, optimized convolution layers can extract shape and texture information. It has been shown that
top layers of CNN extract general features, while deeper layers can capture details, which is related to
outcome labels (Zeiler & Fergus). Hence, when performing transfer learning, researchers need to
determine the depth of feature extractor. When the target domain and the source domain are similar (e.g.
34
Lung CT versus. Pancreas CT), features can be extracted from deeper layers (LeCun et al., 2015).
However, when disparity exists between target and source data (e.g. pancreas CT vs. natural images),
top layers should be used to generate general features (D. George et al., 2017).
Compared to the fixed feature extraction method, the fine-tuning method is more sophisticated since it
not only adopts convolution layers but also fine-tunes some deeper layers. Consequently, the generated
features are optimized for the target domain (R. Yamashita et al., 2018). However, this method requires
a larger dataset to perform fine-tuning, which limits its applications.
Transfer learning enables “deep features extractions” from images of the target domain (Afshar,
Mohammadi, Plataniotis, Oikonomou, & Benali, n.d.). In medical imaging, this new method is often
called “deep learning-based radiomics” or “deep radiomics”. It is hypothesized that, deep radiomics
would outperform traditional radiomics since CNNs are able to extract outcome-related features. Studies
in deep radiomics have been started recently, and more comprehensive investigations are required on
this topic.
1.3.4 Deep learning in medical imaging research
As deep learning developing at a fast pace, a large numbers of deep learning studies have been published
in the context of medical imaging. Since 2012, more than five hundred papers were published, focusing
on segmentation, object detection, and exam results classification (Litjens, Kooi, Bejnordi, Arindra, et
al., 2017). The following section will provide a brief overview of these studies.
Detecting abnormality from medical images is a routine work for clinicians. However, it is one of the
most labor-intensive tasks (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). To help clinicians work more
efficiently, studies in this subject started years ago when Lo et al. trained a 4-layer CNN for nodule
detection in x-ray images (Litjens, Kooi, Bejnordi, Arindra, et al., 2017; Lo et al., 1995). As an image-
based network, currently, CNN is one of the most popular methods in abnormality. In a recent study,
using 224,316 chest radiographs from 65,240 patients, researchers from Stanford University trained a
121-layer CNN, achieving AUC of 0.94 for detecting pleural effusion and AUC of 0.86 for Atelectasis
detection (Irvin et al., n.d.). It has been found that this CNN based network had similar performance to
human experts (Irvin et al., n.d.). Data for this study has been published and an increasing number of
research groups are working on this challenge, aiming to improve the performance (Irvin et al., n.d.).
35
In another recent study, Lakhani et al. trained AlexNet and GoogleNet for pulmonary tuberculosis
detection using 1007 chest radiographs (Lakhani & Sundaram, 2017; R. Yamashita et al., 2018). The
final network achieved AUC of 0.99 for differentiating tuberculosis from healthy cases (Lakhani &
Sundaram, 2017; R. Yamashita et al., 2018). A large scale study in the Netherlands also confirmed the
potential of CNN based detection system (Kooi et al., 2017). Kooi et al. designed a CNN-based
computer-aided diagnosis tool (CAD) using 45,000 mammography images. In abnormality detection,
this model outperformed traditional feature-based model by a large margin (Kooi et al., 2017).
In lesion detection tasks, CNNs were applied to not only X-ray, CT, and MR images, but also color-
scaled retina images. In EyePACS-1, and Messidor-2 datasets, CNNs reached 97.5% sensitivity and
93.4% specificity in diabetes mellitus (DM) detection (Gulshan et al., 2016; Pratt, Coenen, Broadbent,
Harding, & Zheng, 2016). Another large-scale study using 80,000 retina images achieved 75% accuracy
in exudates, hemorrhages and micro-aneurysm detection (Chandrakumar & Kathirvel, n.d.).
In addition to detection, CNNs can also be trained to differentiate or classify abnormalities into different
categories. Image classification is also one of the first areas in which deep learning made a major
contribution to medical image analysis (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). A study
conducted in Japan confirmed the potential of the CNNs in subgroup classifications. Yasaka et al.
trained a CNN using 55,536 CT images and achieved AUC of 0.92 for differentiating liver masses
(Yasaka, Akai, Abe, & Kiryu, 2018). In another 2000 images dataset, CNNs achieved a 90.1% accuracy
in nodule classification, being significantly higher than that of traditional radiomics approach which had
an accuracy of 61% (Lai & Deng, 2018).
The studies discussed above clearly benefited from the large sample size. However, in a small sample
size setting, with transfer learning, CNNs can also achieve acceptable performance (He, Girshick, &
Dollár, 2018; Pan & Yang, 2010; Yosinski, Clune, Bengio, & Lipson, 2014). In tuberculosis
classification tasks, fine-tuning convolution layers elevated the accuracy rates from 53.4% to 57.6%
(Antony, McGuinness, Connor, & Moran, 2016). Using a similar approach, by fine-tuning the ImageNet
pre-trained model, researchers achieved a near expert performance in skin cancer classifications (Esteva
et al., 2017).
36
In another study, using a pre-trained CNN as a feature extractor, the model achieved 70.5% accuracy in
cytopathology image classifications (Kim, Corte-Real, & Baloch, 2016). Radiomics features were also
added to transfer learning studies. Lao et al. fused transfer learning features with traditional radiomics
features for glioblastoma prognosis and showed improved prognosis performance (Lao et al., 2017). It
has been shown that, compared to training a model from scratch, transfer learning models have superior
performances in terms of accuracy and computation time when the sample size is below 1000 (He et al.,
2018; Menegola, Fornaciali, Pires, Avila, & Valle, 2016). Hence, transfer learning methods will play an
increasingly critical role in future medical imaging research.
For some lesion classification tasks, both local information on lesion appearance and global contextual
information on lesion location are needed (Litjens, Kooi, Bejnordi, Arindra, et al., 2017). To address this
issue, researchers started to develop multi-stream architectures where serval models were built
simultaneously (Yuexiang Li, Shen, Li, & Shen, 2018). Combinations of pre-trained and trained from
scratch CNNs can also work together for better performance (Gao, Lin, & Wong, 2015).
Moreover, deep learning models were applied to segmentation and denoising process (Christ et al., n.d.;
Cires¸ancires¸an et al., n.d.; Litjens, Kooi, Bejnordi, Setio, et al., 2017; Oktay et al., n.d.; Razzak et al.,
n.d.; Tajbakhsh et al., 2017). In 2012, Ciresan et al. developed a deep neural network algorithm for
neuron segmentation (Cires¸ancires¸an et al., n.d.). The network was used as a pixel classifier. It took a
square image (patch) as an input and gave a probability of being the neuron membrane for the central
pixel. At the ISBI 2012 conference, this network won the segmentation challenge (Ronneberger,
Fischer, & Brox, 2015). However, there are two major limitations of this network. The first limitation is
computation time. Since the model only provides the probability of a limited number of pixels in a patch
(a square image), segmentation for a large image needs a large number of patches, resulting in a
significant demand for computation power. Secondly, this network has a trade-off between prediction
accuracy and patch size. Small patches have higher accuracies; however, the network can only see a
little context in small patch settings (Ronneberger et al., 2015).
Ronneberger et al. proposed another deep learning architecture called U-Net, which is built upon a fully
convolutional neural network with two paths (Ronneberger et al., 2015). The first path consists of a
traditional convolutional neural network to capture the context in images (Krizhevsky et al., 2012).
Additionally, the second path is used to enable precise localization using transposed convolution. U-Net
37
outperforms previous deep learning-based segmentation approaches in terms of speed and the
adaptability to small sample sizes (Ronneberger et al., 2015). Because of these characteristics, U-Net has
recently become the most popular segmentation method in medical imaging studies. Table 1.5 below
presents a list of recent segmentation studies using U-Net.
Table 1.5: List of representative segmentation studies in the medical imaging field
Domain Sample size Findings Reference
Pancreas segmentation
(CT)
150 abdominal CT
scans
Dice score: 0.840±0.087
Inference time: 0.179s
(Oktay et
al., n.d.)
Liver and tumor
segmentation (CT)
100 abdominal CT
scans Dice score: 0.943
(Christ et
al., n.d.)
Rectal cancer (CT) 278 patients
For CTV, dice score: 0.934
For bladder, dice score: 0.921
(Men,
Dai, & Li,
2017)
Multiorgan
segmentation (CT)
331 contrast-enhanced
abdominal CT images
For artery, dice score: 0.79
For vein, dice score: 0.73
For liver, dice score: 0.93
For spleen, dice score: 0.91
For stomach, dice score: 0.84
For pancreas, dice score: 0.63
(Roth et
al., 2017)
Retina blood vessel
segmentation 40 colour retinal image Dice score: 0.8142
(Alom,
Hasan,
Yakopcic,
Taha, &
38
Asari,
2018)
Retina blood vessel
segmentation 20 color retinal image Dice score: 0.8373
(Alom et
al., 2018)
Retina blood vessel
segmentation 28 color retinal image Dice score: 0.7783
(Alom et
al., 2018)
Liver segmentation
(CT)
20 venous phase
enhanced CT Dice score: 0.94
(Christ et
al., 2016)
Pancreas segmentation
(CT)
147 contrast-enhanced
abdominal CT scans Dice score: 0.897±0.038
(Oda,
Shimizu,
Oda, et
al., 2018)
Pancreas segmentation
(CT)
281 clinical CT Dice score: 0.739±0.152
(Oda,
Shimizu,
Roth, et
al., 2018)
Prostate segmentation
(DWI)
104 Patients Dice score: 0.93 (Clark et
al., 2017)
An open data challenge called MICCAI BraTS (Multimodal Brain Tumor Segmentation challenge) has
become a benchmark for deep learning-based segmentation. From 2013 to 2018, models’ performances
are rapidly improving. Dice scores were elevated from 0.74 to 0.91, suggesting the significant
improvement for automated segmentation tasks in the medical imaging field (Anwar et al., 2018;
Isensee, Kickingereder, Wick, Bendszus, & Maier-Hein, n.d.; X. Zhao et al., 2018).
1.3.5 Future direction
As discussed above, deep learning methods gained success in specific medical imaging tasks. However,
deep learning studies are limited by sample size and interpretability. Although it has been shown that, a
state-of-the-art deep learning model can be trained using 1000 samples even without transfer learning,
most available medical image datasets have much smaller sample sizes (Dů et al., n.d.; He et al., 2018;
39
B. Liu, Wei, Zhang, Yang, & Kong, 2017). Cho et al. from Massachusetts General Hospital conducted a
study on the impact of sample size by training CNNs using six different sample sizes ranging from 5 to
200 (Cho, Lee, Shin, Choy, & Do, 2016). Their results confirmed that, in medical imaging, classification
tasks using CT images generally requires a large sample size (n > 200) (Cho et al., 2016). However,
most popular open-source annotated CT datasets have sample sizes smaller than this number (Aerts et
al., 2014; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019; Yucheng Zhang et al., 2017). Sample
size problems limit the application of deep learning models in medical imaging research, especially in
studies for diseases with low incidence rate including PDAC (Ilic & Ilic, 2016; J. Luo, Xiao, Wu,
Zheng, & Zhao, 2013; Siegel, Miller, & Jemal, 2015).
Additionally, interpretation of CNNs is even more challenging compared to radiomics studies where
features are derived from manually defined formulas. Recent studies have attempted to address this
issue. Zeiler et al. found that, top layers of CNNs can extract local patterns, while deeper layers combine
them into more meaningful structures (Zeiler & Fergus). To better visualize how CNNs make decisions,
activation maps were developed by Zhou et al. and Selvaraju et al. (Selvaraju et al., 2016; B. Zhou,
Khosla, Lapedriza, Oliva, & Torralba, n.d.). These studies highlighted that activation maps can help
researchers to establish trust for deep learning models and discern a stronger model from a weaker
network (Selvaraju et al., 2016, 2017). Without a doubt, further research in visual explanations would
facilitate the application of deep learning models in medical imaging research.
Compared to deep learning methods, traditional radiomics has been studied for a much longer period of
time, providing a large number of significant features. These findings should not be neglected.
Researchers hypothesize that, combing those radiomics features with deep radiomics features will
contribute to a stronger model (Afshar et al., n.d.). In the following studies, we aimed to compare the
effectiveness of radiomics and deep radiomics (transfer learning) models in a small resectable PDAC
sample and find the optimal way of fusing these two information sources for better a prognosis. In the
third study, we modified the loss function in a CNN model, allowing it to provide an accurate prognosis
of PDAC patients at any given timepoint. Focusing on resectable PDAC patients, these studies will be
beneficial for designing personalized treatment plans for them.
40
Chapter 2: Aim and hypothesis
2.1 Study 1: Prognostic Value of Transfer Learning Based Features in Resectable
Pancreatic Ductal Adenocarcinoma
2.1.1 Aims
The main aim of this study is to validate and compare the prognosis performance of transfer learning
feature (deep radiomics) extractors and traditional radiomics feature bank in two independent resectable
PDAC cohorts. For both cohorts, CT images, annotations and clinical outcomes were available. We have
built three prognosis models for overall survival using an engineered (pre-defined) radiomics features
bank “PyRadiomics”, and two transfer learning features extractors trained by ImageNet and Lung CT
images (van Griethuysen et al., 2017).
The performances of these three models will be measured and compared by the area under receiver
operating characteristic curve (AUC). Lastly, risk scores generated by these models were tested in Cox
Proportional Hazards models, assessing not only binary classification performance but also the ability to
provide an accurate prognosis (Khalvati, Zhang, Baig, et al., 2019).
Building a high-performance prognosis model using CT images will be beneficial for resectable PDAC
patients. An accurate prognosis model can provide valuable survival information for clinicians, assisting
them in designing an aggressive treatment plan for an aggressive tumor, improving the survival rates for
resectable PDAC patients. Furthermore, as an increasing number of studies shifting to deep radiomics,
this pioneering study will provide valuable information in choosing appropriate feature banks for other
small sample size studies.
2.1.2 Hypothesis
We hypothesized that the transfer learning model trained from Lung CT images will outperform
traditional radiomics based prognosis model and the transfer learning model pre-trained by natural
images. Furthermore, we hypothesized that, deep radiomics features from the transfer learning model
41
can accurately classify patients into low or high-risk groups, helping clinicians to make effective
treatment decisions.
2.1.3 Rationale for hypothesis
It has been shown that, for glioblastoma prognosis, transfer learning models outperformed the traditional
radiomics models (Lao et al., 2017). Compared to traditional radiomics studies where features are pre-
defined, the formula of deep radiomics features can be optimized for specific tasks, leading to a better
performance (D. George et al., 2017; Lao et al., 2017; Tan et al., 2018). However, in the medical
imaging domain, most transfer learning studies use the ImageNet pre-trained model to extract features
(Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, n.d.; Ravishankar et al.;
Yosinski et al., 2014). ImageNet has 14 million images, which are colour scaled and have different
signal to noise profiles compared to CT images. Therefore, we hypothesized that a model pre-trained by
medical images, namely Lung CT images, may improve the prognosis performance for resectable PDAC
patients.
2.2 Study 2: Improving Prognostic Performance through Radiomics and Deep
Learning Features Fusion in Resectable Pancreatic Ductal Adenocarcinoma
2.2.1 Aims
The first aim of this study is to identify the relationship between radiomics and transfer learning
features. We were interested to see if any associations exist between radiomics features and deep
radiomics features. Since deep radiomics studies were criticized for the lack of the interpretability,
testing the association between deep features and manually defined features can provide another
perspective. We wanted to test whether a transfer learning feature extractor can capture the similar
information identified by pre-defined radiomics features.
Secondly, we aimed to find an optimal method to fuse radiomics features and deep radiomics features.
Although transfer learning methods can achieve high performance given limited data, radiomics studies
have been developed for decades. As a result, a large number of radiomics features have been found to
42
be associated with clinical outcomes. Thus, these radiomics features are still valuable in this transition
period and should not be discarded.
On the other hand, as transfer learning-based feature extractors providing an increasing number of deep
radiomics features, the dimensions of the feature map are expanding at an unprecedented speed. Hence,
finding an optimal method for feature fusion will benefit future studies in this field. To address that, we
have built four fusion models for PDAC prognosis and tested their performance in an independent
validation cohort. If the prognosis performance improved by feature fusion, the model will be able to
provide more accurate prognosis information for healthcare professionals. Resectable PDAC patients
will be further benefited from this high-performance prognosis model.
2.2.2 Hypothesis
We hypothesized that, significant correlations exist between deep radiomics and engineered radiomics
features. Additionally, combining engineered radiomic features with transfer learning-based deep
radiomic features will improve the prognosis performance. Finally, ensemble-based fusion method will
outperform feature-based fusion in terms of the prognosis performance.
2.2.3 Rationale for hypothesis
A recent study confirmed that deep radiomics feature extractors can extract shape and texture
information (Zeiler & Fergus). In addition, several features in the PyRadiomics feature bank were also
designed to extract this information (van Griethuysen et al., 2017). Thus, we hypothesized that there
exist significant associations between radiomics and deep radiomics features. Identifying this association
profile would provide better interpretations for deep radiomics features and facilitate feature fusion as
the next step in building the prognosis model (Gillies et al., 2015; Razzak et al., n.d.).
Additionally, we proposed four feature fusion methods and hypothesized that the model-based feature
fusion method would provide the best overall performance. It has been shown that ensemble methods
are advantageous in alleviating the small sample size problem by incorporating multiple classification
models to reduce the potential of overfitting (P. Yang, Yang, Zhou, & Zomaya, n.d.). In a typical small
43
sample size setting (n=98), we therefore hypothesized that ensemble-based feature fusion would
outperform other fusion methods.
2.3 Study 3: CNN-based Survival Model for Pancreatic Ductal Adenocarcinoma in
Medical Imaging
2.3.1 Aims
In this study, we aimed to extend the application of CNNs in medical imaging from a binary prediction
of survival to a precise prognosis at any given time point. For cancers with poor prognoses including
PDAC, five-year survival rates are low. Hence, a binary prediction of survival provides limited
additional information for clinicians. However, offering a personalized survival probability curve with
respect to time will be more informative. Nevertheless, traditional probability mapping methods (e.g.
Cox Proportional Hazards Model) often have linearity assumptions, which limit their applications. In
this study, we utilized a modified loss function, built a CNN-based transfer learning survival model
(CNN-Survival), and compared the performance of this model to a traditional radiomics model using the
concordance index.
2.3.2 Hypothesis
We hypothesized that the CNN-Survival model will outperform the traditional radiomics based Cox
Proportional Hazards Model and provide a better mapping for patients’ survival patterns.
2.3.3 Rationale for hypothesis
As discussed above, due to the presence of non-linear activation functions in CNNs (e.g. ReLU),
the output of the CNNs will have a non-linear relationship with the input. Compared to the traditional
Cox Proportional Hazards Model which relies on linear relationships, the proposed CNN-Survival may
be a better suit for complex survival patterns. In addition, although the sample size is limited (n=98), the
kernels in CNN-Survival can be optimized using another source domain through transfer learning. Thus,
44
we hypothesized that, CNN-Survival will achieve an acceptable performance using a small resectable
PDAC sample.
45
Chapter 3: Study 1
Title: Prognostic Value of Transfer Learning Based Features in Resectable Pancreatic Ductal
Adenocarcinoma
Authors:
# Name Affiliations
1 Yucheng Zhang 1,2
2 Edrise M. Lobo-Mueller 3
3 Paul Karanicolas 4
4 Steven Gallinger 2
5 Masoom A. Haider 1,2
6 Farzad Khalvati 1,2
Affiliations
1: Department of Medical Imaging, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
2: Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
3: Sunnybrook Research Institute, Toronto, ON, Canada
4: Department of Surgery, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON,
Canada.
46
3.1 Abstract
Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive cancers with an extremely
poor prognosis. Radiomics has shown prognostic ability in multiple types of cancer including PDAC.
However, the prognostic value of traditional radiomics pipelines, which are based on hand-crafted
radiomic features alone, is limited due to multicollinearity of features and multiple testing problem, and
limited performance of conventional machine learning classifiers. Deep learning architectures, such as
convolutional neural networks (CNNs), have been shown to outperform traditional techniques in
computer vision tasks, such as object detection. They require large sample sizes for training which limits
their development. As an alternative solution, CNN-based transfer learning has shown the potential for
achieving a reasonable performance using datasets with small sample sizes. In this work, we developed
a CNN-base transfer learning approach for prognostication in PDAC patients for overall survival. The
results showed that transfer learning approach outperformed the traditional radiomics model on PDAC
data. A transfer learning approach may fill the gap between radiomics and deep learning analytics for
cancer prognosis and improve performance beyond what CNNs can achieve using small datasets.
3.2 Introduction
Pancreatic Ductal Adenocarcinoma (PDAC) is one of the most aggressive malignancies with poor
prognosis (Adamska et al., 2017b; Eibl, 2015). In resectable patients, clinicopathologic factors, such as
tumor size, margin status at surgery, and histological tumor grade have been studied as biomarkers for
prognosis (Ahmad et al., 2001; Ferrone et al., 2012). However, many of these biomarkers can only be
assessed after the surgery and the opportunity for patient-tailored neoadjuvant therapy is lost. Recently,
quantitative medical imaging biomarkers, have shown promising results in prognostication of the overall
survival rate for PDAC patients (Eilaghi et al., 2017).
As a rapidly developing field in medical imaging, radiomics is defined as the extraction and analysis of a
large number of quantitative imaging features from medical images including CT or MRI (Aerts et al.,
2014; Khalvati, Zhang, Wong, et al., 2019). Some radiomic features have been shown to be significantly
associated with clinical outcomes including overall survival (OS) or recurrences in different cancer sites,
such as lung, renal cell carcinoma, and PDAC (Haider et al., 2017; Y. Huang et al., 2016; Klawikowski,
Christian, Schott, Zhang, & Li, 2016; V. Kumar et al., 2013; Parmar, Leijenaar, et al., 2015; Yucheng
47
Zhang et al., 2017). Patients can be further dichotomized using those radiomic features into low-risk and
high-risk groups, guiding clinicians to design personalized treatment plans (Aerts et al., 2014). Although
limited work has been done on radiomics in the context of PDAC, recent studies have confirmed the
potentials for discovering new quantitative image biomarkers for PDAC (Eilaghi et al., 2017).
Despite the recent progress, radiomics analytics solutions have limitations. The first limitation is the
multicollinearity among features. Radiomic features and engineered features are handcrafted and hence,
the driving equations for many of these features are similar, making them highly correlated. As a result,
if one radiomic feature is found to be predictive (or prognostic) for an outcome (i.e., significant), the
similar features will most likely be predictive as well. Consequently, although a large number of
significant features can be found, they are all highly correlated and fail to explain much of the variation
in the outcomes, leading to poor performances.
The second limitation of radiomics is the multiple testing problem. Since thousands of features are tested
at the same time, the chance of facing false positives will increase substantially. Given the p value
threshold as 0.05, testing 100 sets of random numbers with the survival outcome, one would expect to
see five significant features (Type I error). However, many radiomics studies in the literature did not
perform multiple testing control. Therefore, these studies are considered exploratory, and some of the
identified features may be false positives (V. Kumar et al., 2013). These limitations eventually harm the
performance of radiomics based models. Comparatively, deep learning architectures have been shown to
achieve a promising performance for both diagnosis and prognosis.
One of the most well-known architectures for deep learning (neural network) is the convolutional neural
network (CNN) (Schmidhuber, 2014). A CNN performs a series of convolution and pooling operations
to get comprehensive quantitative information from input images. Compared to hand-crafted radiomic
features that are predesigned and fixed, the coefficients of CNN are modified in the training process.
Hence, the final features generated from a CNN are associated with the target outcomes. It has been
shown that deep learning architectures are effective in different medical imaging-related tasks, such as
segmentation for head and neck anatomy and diagnosis for the retinal disease (De Fauw et al., 2018;
Litjens, Kooi, Bejnordi, Setio, et al., 2017; Nikolov et al., 2018).
However, to train a CNN from scratch, millions of parameters (coefficients) need to be tuned. This
requires a large sample size which is not feasible in most medical imaging studies. As an alternative
48
deep learning solution, transfer learning may be more suitable for medical imaging-related tasks since it
can achieve a comparable performance using limited amounts of data (Chuen-Kai Shie, Chung-Hisang
Chuang, Chun-Nan Chou, Meng-Hsi Wu, n.d.).
Network-based transfer learning is defined as taking images from another domain, such as natural
images (ImageNet) to build a pre-trained model and then apply the pre-trained model to the target
images (e.g., CT images of lung cancer) (Ravishankar et al.). The idea of transfer learning is based on
the assumption that the structure of a CNN is similar to the human visual cortex as both composing of
layers of neurons (Pan & Yang, 2009). Top layers of CNNs can extract general features from images,
while deeper layers are able to extract information that is more specific to the outcomes. Moreover,
although typical CNN models contain millions of parameters, most of the coefficients belong to the top
layers. In other words, training top layers require a larger dataset while deeper layers require fewer data.
Transfer learning utilizes this property, training top layers using large pre-trained datasets while
finetuning deeper layers using data from the target domain. For example, the ImageNet dataset contains
more than 14 million images. Hence, pre-training a model using this dataset would help the model
learning how to extract general features using initial layers. Given that many image recognition tasks are
similar, top (shallower) layers of the pre-trained network can be transferred to another CNN model. In
the last step, deeper layers of the model will be trained using the target domain images (Torrey &
Shavlik, n.d.). Since the final (deeper) layers are more target specific, fine-tuning them using the target
domain images may help the model to quickly adapt to the target domain, and hence, improve the
performance.
In the medical imaging field, target data is often small, making it impractical to properly fine-tune the
deeper layers. Consequently, in practice, the top (shallower) layers of a pre-trained CNN can be used as
a feature extractor (D. George et al., 2017; Hertel, Barth, Käster, & Martinetz, 2017; Thomaz et al.,
2017). Given that top layers can capture high-level and informative details from images, passing the
target domain images through these layers allows extractions of features. These features can be further
used to train a classifier for the target domain. This unique process enables building a classifier using a
small target domain.
As discussed above, single institution PDAC datasets are often small (e.g., <100 cases) and hence, are
not suitable for training CNNs from scratch or finetuning deep layers. In this study, we evaluated the
49
prognosis performance of two different transfer learning approaches applied to pre-operative CT scans
for resectable PDAC cases and compared their performance to that of the traditional (engineered)
radiomics feature bank.
3.3 Methods
3.3.1 Dataset
Two cohorts from two different hospitals consisting of 68 (Cohort 1) and 30 (Cohort 2) patients were
enrolled in this retrospective study. All patients underwent curative intent surgical resection for PDAC
from 2007 – 2012 and 2008 – 2013 and did not receive other neo-adjuvant treatment. Pre-operative
portal venous phase contrast-enhanced CT images were used. Overall Survival was collected as the
primary outcome. To exclude the effect of postoperative complications on the prognosis, patients who
died within 90 days after the surgery were excluded. Institutional review board approval was obtained
for this study from both institutions and the need for written informed patient consent was waived.
An in-house developed Region of Interest (ROI) contouring tool (ProCanVAS (Junjie Zhang, Baig,
Wong, Haider, & Khalvati, 2016)) was used by a radiologist with 18 years of experience who completed
the contours blind to the outcome (overall survival). Following the protocol, the slices were contoured
with the largest visible cross section of the tumor on the portal venous phase. When the boundary of the
tumor was not clear, it was defined by the presences of pancreatic or common bile duct cut-off and
review of pancreatic phase images (Eilaghi et al., 2017). An example of the contour is shown in Figure
3.1 below.
50
Figure 3.1. A manual contour of CT scan from a representative patient in cohort 2.
PyRadiomics features were extracted using ROI defined by radiologists’ contour. For transfer learning
feature extraction, we used the same ROI with zero-padding (140*140 pixels in grey scale).
3.3.2 Radiomics feature extraction
Radiomics feature was extracted using the PyRadiomics library (van Griethuysen et al., 2017) (version
2.0.0) in Python. To ensure features were extracted from tumor regions exclusively, voxels with
Hounsfield unit under -10 and above 500 were excluded so that the presence of fat and stents will not
affect the feature values. The bin width (number of gray levels per bin) was set to 25. In total, 1428
radiomic features were extracted for both cohorts (Cohort 1 and 2). Table 3.1 lists different classes of
features used in this study.
51
Table 3.1: List of radiomic feature classes and filters
First-order features Histogram-based features
Second-order texture features Features extracted from Gray-Level Co-
Occurrence matrix (GLCM)
Morphology features Features based on the shape of the region of
interest
Filters No filter, exponential, gradient, logarithm, square,
square-root, local binary pattern
3.3.3 Transfer learning
We used two pre-trained transfer learning models from ImageNet pre-trained ResNet (ImgRes) and
Lung CT pre-trained ResNet (LungRes) (He et al., 2015). Residual Neural Network (ResNet) is a state-
of-the-art deep learning architecture with high classification performance using 34 layers. The ResNet
model avoids the vanishing gradient problems by adding a direct path between layers and skipping one
or more layers in between. This allows a deeper model with better performance.
Two datasets were used to pre-train the ResNet model. The first one is ImageNet, which is an image
database contains 14,197,122 images from 21841 different categories (Deng et al., 2009). The other
dataset is Lung Cancer dataset, which was published on Kaggle with CT images from 888 patients
(Armato et al., 2011). ImageNet pre-trained ResNet was directly available in Keras 2.0 which is a
Python-based deep learning library. We trained LungRes from scratch using lung CT images.
Transfer learning can be done in multiple ways depending on the sample size and the relationship
between pre-trained domains and target domains (Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan
Chou, Meng-Hsi Wu, n.d.; Gu Kim, Choi, & Man Ro, n.d.). As shown in Figure 3.2, when the pre-
trained and target domains are similar, the features are usually extracted from the deeper layers.
Comparatively, when the two domains are different (natural images vs. cancer images), the features are
usually extracted from the shallower layers of the pre-trained network.
52
Figure 3.2. Workflow for transfer learning studies.
A. When pre-train and target domain is different.
B. When pre-train and target domain is similar.
As previously discussed, depending on the similarities between the pre-trained domain and target
domains, transfer learning can be performed in different ways. Given that our target domain data (PDAC
CT images) is small and different from the ImageNet, with transfer learning architecture using ImgRes,
features were extracted from the shallower layer (i.e., 12th layer). For LungRes, since the domains are
similar (CT images from NSCLC and PDAC patients), all the ResNet layers were frozen, and features
were extracted from the final layer (i.e., 34th layer) (Breiman, 2001). In total, 2048 ImgRes and 64
LungRes features were generated.
3.3.4 Feature analysis
To study the feature-wise prognostic value of different feature banks, univariate Cox Proportional
Hazards Model was used to test the association between clinical outcomes and individual features.
Features with Wald test p value smaller than 0.05 were considered as significant.
In Cohort 1, three prognostic models were built using features from three feature banks using Random
Forest classifiers, which had a built-in feature reduction algorithm for selecting best prognostic features
by tuning the number of trees and features at each node. The prognostic values of the three models were
53
evaluated in Cohort 2 using the area under the receiver operating characteristic (ROC) curve (AUC).
Sensitivity tests were applied to test the difference between three ROC curves.
Using these features, these three prognostic models can produce survival probabilities for new patients.
These probabilities can be treated as risk scores and tested for their prognostic power using univariate
Cox Proportional Hazards Model in Cohort 2 (test set). Training and validation datasets were collected
from two different institutions, making the validation process robust and minimizing the potential
overfitting. These analyses were done in R (version 3.5.1) using “caret” , “pROC,” and “survival”
package (Matthias Gamer & Matthias Gamer, 2015; Terry & Therneau, 2018).
3.4 Results
3.4.1 Feature-wise prognostic values
To determine the prognosis value of features from different feature extraction methods, the associations
between individual features and the overall survival were tested using the Wald test in univariate Cox
Proportional Hazards Model in Cohort 1. Among 1,428 PyRadiomics features, 283 features had
significant p values (p value < 0.05). Details of these 283 features were listed in Table A-2 in the
Appendix. Within 2,048 ImgRes features, 49 features had a p value smaller than 0.05. Lastly, for 64
LungRes features, only 2 features were significant.
It is interesting to observe that with respect to feature-wise performance, the PyRadiomics library has a
higher ratio of significant features than those of ImgRes and LungRes feature banks (0.20 vs. 0.024 and
0.031, respectively). However, a high number of significant features does not necessarily lead to a high-
performance prognostic model since many of these features may be correlated. Thus, testing the
performance of the feature banks on a different dataset (i.e., test) is necessary.
3.4.2 Prognostic model performance
To compare the prognostic performance of each of the feature extraction methods for overall survival
for PDAC patients, the prognostic models were trained using all features extracted from Cohort 1 and
54
tested in Cohort 2 using a Random Forest classifier. When using the PyRadiomics feature bank, the
Random Forest model yielded an area under the receiver operating characteristic (ROC) curve (AUC) of
0.57. Using ImgRes feature bank, the model achieved an AUC of 0.71. Finally, using LungRes feature
bank, the AUC reached 0.74.
The AUCs of both transfer learning methods are higher compared to that of PyRadiomics. Comparing
the ROC curves using the sensitivity test (DeLong, DeLong, & Clarke-Pearson, 1988), there was no
significant difference between ROCs of PyRadiomics vs. ImgRes and ImgRes vs. LungRes.
Nevertheless, LungRes feature bank had significantly higher performance than that of PyRadiomics
feature bank with a p value of 0.03. This result indicates that the transfer learning model based on lung
CT images (LungRes) significantly improves the prognostic performance of the model compared to
traditional radiomics methods (e.g., PyRadiomics). Figure 3.3 shows the ROC curves for three models
Figure 3.3. A: ROC curve using PyRadiomics feature bank only (AUC = 0.57), B: ROC curve with
ImgRes feature bank (AUC = 0.71), C: ROC curve for LungRes feature bank (AUC = 0.74).
3.4.3 Risk score
Risk scores were generated by three prognostic models for patients in Cohort 2. In univariate Cox
Proportional Hazards Model, PyRadiomics and ImgRes prognostic models had p values of 0.23 and
0.253 for the risk scores. The LungRes prognostic model was the best model yielding a p value of
0.0395 for the risk factor, indicating that transfer learning architecture pre-trained by lung cancer images
55
can produce a prognostic risk factor for PDAC patients. The hazard ratio (HR) and confidence intervals
(CI) for risk scores generated by the PyRadiomics, ImgRes, and LungRes prognostic models were HR =
1.41 (CI: 0.80 – 2.55), HR = 1.31 (CI: 0.81 – 2.12), and HR = 1.78 (CI: 1.34 – 2.35), respectively (Table
3.2). Using the risk scores, if we dichotomize patients in Cohort2 into high risk and low-risk groups, the
LungRes transfer learning prognostic model yields the best separation in terms of the survival patterns.
Figure 3.4 shows the Kaplan–Meier plots for the risk factors of the PyRadiomics, ImgRes, and LungRes
prognostic models.
Figure 3.4. Kaplan-Meier plots for OS in Cohort 2.
A. PyRadiomics based risk score (P=0.23)
B. ImgRes based risk score (P=0.253)
C. LungRes based risk score (P=0.0395)
56
Table 3.2: List of hazard ratios and p values for risk scores for prognostication of overall survival in the
validation cohort
Prognostic Model p value Hazard Ratio (HR) and
Confidence Interval (CI)
Engineered Radiomic
Features P= 0.23
HR = 1.41
CI: 0.80 – 2.55
ImgRes P = 0.253
HR = 1.31
CI: 0.81 – 2.12
LungRes P = 0.0395
HR = 1.78
CI: 1.34 – 2.35
Abbreviations: CI: confidence interval; ImaRes: Deep transfer learning model pre-trained by ImageNet
(natural images). LungRes: Deep transfer learning model pre-trained by lung CT images.
3.5 Discussion
In this study, we developed and compared three prognostic models for overall survival in resectable
PDAC patients using the PyRadiomics and deep radiomics features banks pre-trained by natural images
and lung CT images. The lung CT pre-trained transfer learning model achieved significantly better
prognosis performance compared to traditional radiomics approach. The PyRadiomics feature bank had
a higher proportion of significant features compared to the other two transfer learning feature extractors
(20% vs. 2.4% and 3.1%). However, these features are correlated, and a higher number of significant
features are mostly due to the multicollinearity among the engineered features. Hence, the majority of
these hand-crafted features carry redundant predictive information (Toloşi & Lengauer, 2011). In
addition, due to the multiple testing problem, some significant features may be false positives. Hence,
they failed to provide prognostic information to the model. These two shortcomings of engineered
57
radiomic features (multicollinearity and multiple testing problem) become more acute when a prognostic
model is built using all features. As a result, the final risk score produced by the model is not prognostic
of the outcome (e.g., P=0.23). The risk score generated by the transfer learning model pre-trained by
natural images is not significant either (P=0.253). This was expected due to the substantial difference
between natural images and PDAC CT images. The best prognostic performance was achieved by the
transfer learning model pre-trained by lung CT images with a p value of 0.0395. This indicates that a
pre-trained CNN, which acts as feature extractor, can generate informative features and provide
prognosis information. It is worth to note that, the HR for LungRes risk score is higher than that of
CA19-9 in PDAC prognosis.
This study showed the potential of transfer learning in a typical small sample setting. If Cohort 1 (PDAC
cases alone) was used to train a CNNs from scratch with no pre-training, and then tested on cohort 2, the
final output would not provide any prognostic value (AUC of ~0.50). Transfer learning, unlike
conventional deep learning methods which need large datasets, can achieve acceptable performance
using a limited number of samples, making it suitable for most medical imaging studies. As the power of
quantitative medical imaging via deep learning is recognized in the research community, the imaging
data is rapidly growing. Nevertheless, the amount of data required for training a CNN from scratch to
achieve meaningful results is far beyond the capacities of most of the existing databases. Thus, transfer
learning can play a key role in applying deep learning to medical imaging studies.
As a powerful prognostic model, deep transfer learning is not limited to only predicting binary survival.
It can also be used to predict patients’ outcomes for given time intervals (e.g. 5 years). Although we
used the Cox Proportional Hazards Model on the risk score and reported hazard ratios, this was done
independently. The final prognostic model itself can only provide binary prognostications. In the
following studies, we aimed to integrate the Cox Proportional Hazards Model into deep transfer learning
approach to enable simultaneous training of both Cox Proportional Hazards Model and transfer learning
models based on binary outcome and survival time data. This generative prognostic model will have an
improved performance when compared to that of the existing model since the features it generated are
associated with not only the binary outcome but also the survival duration. Recent work on these
generative models using conventional CNNs (e.g., DeepSurv (Katzman et al., 2016)) confirms the
potential for the proposed model.
58
Although deep transfer learning outperforms the engineered radiomics model, one must not assume that
radiomic features should be discarded altogether. In fact, these hand-crafted features have been shown to
be prognostic of survival in different cancer sites (Aerts et al., 2014; Gillies et al., 2015; Parekh &
Jacobs, 2016). Thus, in future studies, using feature fusion techniques that combine engineered radiomic
features with deep transfer learning model has merit. Feature fusion is a technique to fuse two sets of
features while retaining their information (Mangai, Samanta, Das, & Chowdhury, 2010). It has been
shown that feature fusion can further improve the prediction accuracy in image classification tasks (Sun,
Zeng, Liu, Heng, & Xia, 2005). An optimal feature fusion method which combines engineered radiomic
features with deep transfer learning features may further improve the overall performance of the
prognostic model.
One limitation of the present study is the small dataset of the target domain (PDAC). A larger dataset
would allow us to further investigate the effectiveness of transfer learning and whether there exists a
threshold for data size to improve performance for the transfer. In future work, using a larger dataset, we
will address this research question, which will deepen our understating of deep learning and its
applicability to medical imaging for prognostication of cancer.
3.6 Conclusion
Deep transfer learning has the potential to improve the performance of prognostication for cancers with
limited sample sizes such as PDAC. In our resectable PDAC cohorts, Deep transfer learning models
outperformed conventional and engineered radiomic models.
59
Chapter 4: Study 2
Title: Improving Prognostic Performance through Radiomics and Deep Learning Features Fusion
in Resectable Pancreatic Ductal Adenocarcinoma
Authors:
# Name Affiliations
1 Yucheng Zhang 1,2
2 Edrise M. Lobo-Mueller 3
3 Paul Karanicolas 4
4 Steven Gallinger 2
5 Masoom A. Haider 1,2
6 Farzad Khalvati 1,2
Affiliations
1: Department of Medical Imaging, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
2: Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
3: Sunnybrook Research Institute, Toronto, ON, Canada
4: Department of Surgery, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON,
Canada.
60
4.1: Abstract
Radiomics, as an analytic pipeline for quantitative imaging feature extraction and analysis, has grown
rapidly in the past few years. Recent studies in radiomics aimed to investigate the relationship between
tumors imaging features and clinical outcomes. Open source radiomics feature banks enable the
extraction and analysis of thousands of pre-defined features. On the other hand, deep learning
approaches have also shown their potentials in the quantitative medical imaging field, providing even
more imaging features. However, the large dimension of features in medical imaging studies has become
an obstacle due to multicollinearity and multiple testing problems. In this study, CT images from
resectable Pancreatic Adenocarcinoma (PDAC) patients were used to compare the prognosis
performance of common feature reduction and fusion methods. It has been shown that the risk-score
based feature fusion and reduction method significantly improves the prognosis performance for overall
survival in resectable PDAC cohorts, elevating the Area under ROC curve (AUC) from 0.74 to 0.83.
4.2: Introduction
Radiomics features are designed to decode the predictive information in medical images for cancer
patients. As a quantitative approach, radiomics involves the extraction and analysis of quantitative
medical imaging features and establishing correlations between these features and clinical outcomes
such as patient survival (Aerts et al., 2014; Khalvati, Zhang, Wong, et al., 2019; V. Kumar et al., 2013).
Several radiomic features have been found to be significantly associated with various clinical outcomes
in multiple cancer sites such as lung, pancreas, and kidney (Aerts et al., 2014; Eilaghi et al., 2017;
Gillies et al., 2015; Haider et al., 2017; Oikonomou, Khalvati, & et al., 2018; Parmar, Leijenaar, et al.,
2015; Yip & Aerts, 2016).
In the past few years, the pipeline for traditional radiomics analysis has been established (Parekh &
Jacobs, 2016). As discussed in Chapter 1, the traditional pipeline consists of four steps: image
acquisition, segmentation, feature extraction, and model building. The core of traditional radiomics
studies relies on the extraction of a set of engineered and hand-crafted features based on pre-defined
mathematical formulas. These engineered features, which are extracted from regions of interest
annotated by clinicians, have been designed to capture different characteristics of images. For example,
the first order features measure the distribution of pixel intensities while second-order features based on
grey-level co-occurrence matrix (GLCM) extract texture information. Efforts have been made to
61
standardize the feature banks by implementing open source libraries such as PyRadiomics (van
Griethuysen et al., 2017). In this feature bank, thousands of engineered features from different classes of
features can be extracted from 2D or 3D medical images. These features can be tested for associations
with clinical outcomes such as overall survival, recurrence, or genetic mutations (Mazurowski, 2015).
Several cross-cohort and multi-centre studies have shown that serval PyRadiomics features are robust to
different scanners and clinician annotations (Aerts et al., 2014; Khalvati, Zhang, Baig, et al., 2019; B.
Zhao et al., 2016).
Despite the recent progress, traditional radiomics analytics pipeline has few drawbacks. First, the
formulas of features are pre-defined, and can be very similar. This leads to high correlations among
different features. As a result, if a feature is found to be significantly associated with a certain clinical
outcome, highly correlated features will more likely to be significant as well. Consequently, while the
high dimension of features increases the complexity and computational power requirements, there is no
corresponding increase in the prognosis performance. Second, testing radiomic features one by one
increases the chance of producing false positives. Several radiomics studies lack multiple testing control
and hence, some discovered significant features may be the result of type I errors (Yip & Aerts, 2016).
Third, many hand-crafted features were not specifically been designed for medical images and related
tasks. For different medical imaging modalities and tumor phenotypes, hand-crafted features lack the
flexibility to adapt to various images and clinical outcomes. These shortcomings in the traditional
radiomic analytics pipeline have inspired new research which takes advantage of the recent impressive
progress in deep learning and convolutional neural networks to improve the performance of the
predictive models.
Convolutional neural networks (CNNs) are one of the most frequently used deep learning architectures
in computer vision tasks (Krizhevsky et al., 2012). CNNs apply a series of convolution operations on
input images preserving the spatial relationship between pixels and mapping these relationships on to
outputs. During the training phase, parameters of the convolution operations are tuned. Consequently,
the kernels will be updated, so that they can capture information specifically related to the classification
task (e.g., outcome prediction) at hand. In medical imaging, this allows researchers to generate
customized feature maps for specific modality or diseases, and further improves performance (R.
Yamashita et al., 2018). However, training these parameters requires a large sample size, which is
62
usually not available in a typical medical imaging research setting. To overcome this limitation, transfer
learning-based feature extraction has been proposed (Pan & Yang, 2010).
Transfer learning was developed based on an assumption that the structures of CNNs are similar to the
mechanism of human visual cortex (Ravishankar et al.). The top layers of CNNs can extract general
features from images, while the deeper layers are more specific to the target. Pre-training CNNs using
large image datasets such as ImageNet helps the model learning how to extract general features. Since
many image recognition tasks are similar, the top layers of the network can be transferred to another
target domain (Tan et al., 2018). On the other hand, deeper layers of CNNs can extract “higher-order”
information which is associated with the target outcome. Thus, if the target domain is similar to the pre-
trained domain, deeper layers can be transferred to extractor features.
In practice, depending on the level of similarity between the target domain and the source domain,
transfer learning features can be extracted from different layers. Training classification models using
those transfer learning features generally requires fewer sample sizes. As discussed above, training a
CNN from scratch requires a large amount of sample size and longer computational time.
Comparatively, transfer learning offers a solution, and enables the application of CNNs in the medical
imaging domain.
Deep learning and transfer learning-based feature extraction have shown promising results in cancer
assessment (Lao et al., 2017). Several radiomic features are widely recognized for their effectiveness in
cancer prognosis as well (Aerts et al., 2014; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019;
Oikonomou et al., 2018). Furthermore, it has also been shown that, combining pre-defined features with
deep learning-based features further improved the performance (Lao et al., 2017). Hence, it is crucial to
develop a feature reduction method which can fuse the predictive power of deep radiomics with pre-
defined radiomic features to achieve optimal performance.
Traditional feature reduction methods can be classified into two groups: supervised and unsupervised
feature reduction. The main difference between the two is that, unsupervised methods reduce features
based on the characteristics of features regardless of the outcome. Comparatively, supervised methods
rely on the association between features and the outcome.
63
Principle Component Analysis (PCA) is a common unsupervised feature selection method, which uses
an orthogonal transformation to convert a set of observations of possibly correlated variables into a set
of linearly uncorrelated variables, known as principal components (Abdi & Williams, 2010). These
components can explain most of the variation in the original features and retain that information while
reducing the number of features.
For binary outcomes, supervised feature selection methods usually compare the distributions of features
for positive and negative groups. If these two groups have a significant difference in terms of their
values, the feature is considered to be predictive. As a supervised method, Boruta algorithm, which is a
wrapper built around the Random Forest classification algorithm (Kursa & Rudnicki, 2010), tries to
capture all the important features with respect to an outcome. First, it duplicates the dataset and shuffles
the values in each column, generating random features which have a similar distribution like the original
features. Then, it tests the performance of these random features. The best performing random feature
will be set as a benchmark, where all the real features performing worse than this will be eliminated
(Kursa & Rudnicki, 2010). After several iterations, a set of significant features will be generated through
this algorithm. Although in supervised feature reduction algorithms multiple testing issue is inevitable
(Yip & Aerts, 2016), models based on Boruta feature selection method are less prone to this problem
since it has built-in multiple testing corrections.
In this paper, first, we compare the performance of three feature reduction methods: PCA, Boruta, and
Cox Proportional Hazards Model (CPH). These are applied to the combined feature set of pre-defined
and deep radiomic features. We then propose a feature reduction and fusion method, which combines
the predictive power of pre-defined and deep radiomic features and produces a single risk score. Our
results illustrate that the proposed feature fusion and reduction method significantly improves the
performance of the model for the prognostication of overall survival of PDAC patients when compared
to traditional feature reduction models (PCA, Boruta, and CPH).
64
4.3 Methods
4.3.1 Dataset
Two cohorts from two different hospitals consisting of 30 and 68 patients were enrolled in this
retrospective study. All the patients underwent curative intent surgical resection for PDAC from 2007 –
2012 and 2008 – 2013 and did not receive other neo-adjuvant treatment. Contrast-enhanced CT images
were obtained pre-operatively. Overall survival data were collected as the primary outcome. To exclude
the effect of post-operative complications on the prognosis, the patients who died within 90 days after
surgery were excluded. Institutional review board approval was obtained for this study from both
institutions. An in-house developed region of interest (ROI) contouring tool (ProCanVAS) was used by
an experienced radiologist (Junjie Zhang et al., 2016). The reader contoured the ROIs blind to the
outcome. A cohort with 68 patients from one institution was used as the training cohort while another
cohort with 30 patients from a different institution was used as the test cohort.
4.3.2 Radiomics Feature Extraction
Pre-defined radiomic features were extracted using the PyRadiomics library (version 2.0.0) in
Python(van Griethuysen et al., 2017). To ensure that features were extracted from tumor regions
exclusively, voxels with Hounsfield unit (HU) < -10 and > 500 were excluded to eliminate fat and stents
from the feature values. In total, 277 radiomic features were extracted for both cohorts. Details of these
features are listed in the table 4.1 below.
65
Table 4.1: Number of features extracted from different filters
Image filters/ Feature First order glcm
Original 18 23
Logarithm 18 23
Square root 18 23
Square 18 23
LBP-2D 56 0
Gradient 18 23
Exponential 16 0
4.3.3 Transfer Learning Feature Extraction
We used two transfer learning models including the ImageNet pre-trained ResNet (He, Zhang, Ren, &
Sun, 2016) (ImgRes) and the Lung CT pre-trained ResNet (LungRes) (He et al., 2015). ResNet (He et
al., 2016) (Keras-inception-resnet-v2) was chosen since it is a state-of-art deep learning architecture
with high classification performance. Through adding direct paths, the ResNet model avoids the gradient
vanishing problem and achieves better performance.
Two datasets were used to pre-train the ResNet model. The first one is ImageNet (Deng et al., 2009),
which contains 14,197,122 natural images from 21,841 different categories. The second dataset is Non-
Small Cell Lung Cancer (NSCLC) dataset, which was published on Kaggle with CT images from 888
patients (Aerts et al., 2014). ImageNet pre-trained ResNet was directly available in Keras 2.0 which is a
python- based deep learning library (Chollet & Others, 2015). The LungRes CNN was trained from
scratch using the lung CT images.
The process of transfer learning varies depending on the similarity of the pre-trained domain and target
domain. Since our target domain (pancreatic CT) is small and different from the pre-trained domain
66
(ImageNet - natural images), during transfer learning process using ImgRes, features were extracted
from a shallower layer (12th layer). For LungRes, since the pre-trained and target domains are rather
similar (lung and pancreatic CT), features were extracted from the final layer before the classifier. In
total, 2048 ImgNet features and 64 LungRes features were extracted.
4.3.4 Correlation
To investigate the correlation between the features extracted using traditional radiomics pipeline
(PyRadiomics) and transfer learning approaches (ImgRes, and LungRes), Pearson correlation
coefficients were calculated for each pair of feature sets in the training cohort (n=68) (Sedgwick, 2012).
Mean correlation coefficient was calculated for each combination of the three different feature
extraction methods (PyRadiomics, ImgRes, and LungRes). The distributions of the correlation
coefficients were also calculated.
4.3.5 Proposed Prognosis Model
To investigate the optimal feature reduction and fusion methods, we trained four prognosis models using
CT images from Cohort 1 (n=68) and validated them in Cohort 2 (n=30). Figures 4.1-A, 4.1-B, and 4.1-
C shows the prognosis model using three traditional feature reduction algorithms; PCA, CPH, and
Bortua. In each model, the three feature banks (PyRadiomics, ImgRes, and LungRes) were concatenated
together. Then, the feature reduction algorithm was applied to these features. The remaining features
were used to train the Random Forest classifier using the training cohort, with the derived model
validated in the test cohort which was collected in an independent hospital site. For the CPH method, p
value <0.05 was used as a feature selector.
Our proposed risk score-based method is illustrated in Figure 4.1-D. First, using the training cohort,
three different Random Forest models were trained separately using each of the three feature banks
(PyRadiomics, ImgRes, and LungRes). Each of these models was then used to produce the probability
for every patient in the training cohort through 10-fold cross-validation. We treated these probabilities as
new features, based on which, the final prognosis model was built through another Random Forest
classifier. In testing, for each patient, three probabilities were generated using the three models. Next,
these probabilities were fed into the final prognosis model, which provided the final risk score.
67
Figure 4.1. Pipelines for different feature fusion methods.
A. Unsupervised feature fusion using PCA. Features from three feature banks will be fused using PCA,
generating few components. Later, these components were used to build a model in the training cohort. In
the end, the performance of the model would be evaluated in the validation cohort.
B. Supervised feature reduction using Boruta. Boruta will identify prognostic features which will be used to
build a prognosis model in the training dataset. Its performance will be validated in the testing cohort.
C. Supervised feature reduction using Cox-Regression. Each feature was tested using univariate Cox-
regression. Significant features will be used in building a prognosis model, which will be validated in the
validation cohort.
D. Risk-score based feature fusion method. Three prognosis models were built using features from three
feature banks. The prediction outputs of those models were considered as risk-scores. Hence, for every
patient, there will be three risk-scores. Later, another model was trained using these risk-scores in the
training set and validated in the testing cohort.
68
Area under ROC curve (AUC) (Fawcett, 2005) was used to measure the performance of these four
approaches. The ROC-based specificity tests were applied to test the difference between the AUCs of
different models (DeLong et al., 1988). These analyses were performed through “pROC” package in R
(Version 3.5.1)
4.4 Results
4.4.1 Correlation Analysis Between Pre-defined and Deep Radiomic Features
Within each feature bank, the average absolute value of Pearson correlations coefficients of 277
PyRadiomics features was 0.32, while ImgRes (2048 features) and LungRes (64 features) had mean
correlations of 0.24 and 0.27 respectively. This showed that PyRadiomics features had a higher
correlation among each other compared to deep radiomic features. The cross-correlation of PyRadiomics
and ImgRes features yielded a mean absolute coefficient of 0.18, which was the same for PyRadiomics
vs. LungRes features. The two deep transfer learning-based features banks (ImgRes and LungRes) had a
slightly higher mean correlation coefficient of 0.22. Table 4.2 below summarizes the correlation results.
Table 4.2: Absolute Pearson correlation coefficient between features from each feature extraction
method
PyRadiomics (277) ImgRes (2048) LungRes (64)
PyRadiomics (277) 0.32 0.18 0.18
ImgRes (2048) 0.18 0.24 0.22
LungRes (64) 0.18 0.22 0.27
69
Figure 4.2. Correlation heatmap of three different feature extraction methods.
The heatmap in Figure 4.2 shows the correlation details. Each dot in Figure 4.2 represents a correlation
coefficient. White colour means that the coefficient is 0, while red and green dots represent positive or
negative correlations. There are colour blocks in PyRadiomics versus PyRadiomics region, indicating a
high correlation among the PyRadiomics features. The colour is lighter in ImgRes vs. PyRadiomics, and
LungRes vs. PyRadiomics regions, showing that the correlation coefficients are lower across these
feature banks.
The distribution of the correlation coefficients (in absolute value) are also displayed in histogram form
in Figure 4.3 for PyRadiomics vs. ImgRes and PyRadiomics vs. LungRes. As illustrated by skewed
distributions, most of the pre-defined and deep radiomic features have no or weak correlation among
each other. However, it is clear that few features have high correlations with coefficients higher than 0.7
(Mukaka, 2012). This result indicates that, some deep transfer learning features (deep radiomic
features) could resemble properties of certain pre-defined radiomic features. As an example, ImgRes
70
feature “v620” had a correlation coefficient of 0.86 with PyRadiomics feature
“gradient_firstorder_RootMeanSquared”, and 0.83 with “gradient_firstorder_TotalEnergy”.
Figure 4.3. Histogram of Pearson correlation coefficients.
A. Correlation coefficients from PyRadiomics and ImgRes.
B. Correlation coefficients from PyRadiomics and LungRes
4.4.2 Prognosis Performance of the Proposed Prognosis Model
As shown in Figure 4.2, the performances of three feature reduction methods (PCA, Boruta, and CPH)
were compared to that of the proposed risk score-based prognosis model.
71
PCA method generated 41 components to represent the variance in the original 2,389 features of the
combination of the PyRadiomics, ImgRes, and LungRes feature banks. Boruta feature reduction method
selected 2 features in 1,000 iterations, with a cut off at 0.1 (p value cut off for Boruta method). CPH
method identified 115 features associated with overall survival in the training cohort. Particularly, 55 of
them belong to the PyRadiomics feature bank, while 58 were extracted using ImgRes. LungRes also
contributed another two features. The proposed risk score-based model generated a single risk score
using the probabilities of the three individually trained Random Forests classifier based on the
PyRadiomics, ImgRes, and LungRes feature sets. The AUC for each method was calculated on the test
cohort.
The AUC for PCA, Boruta, and CPH methods were 0.72, 0.56, and 0.66, respectively. The proposed risk
score-based method produced the highest AUC of 0.83. Comparing the feature reduction methods using
specificity test, the performance of the proposed risk score-based method was significantly higher than
PCA (p value = 0.049), Boruta (p value = 0.0015), and Cox-Regression methods (p value = 0.015). The
results suggested that a stacking model, which based on probabilities calculated by multiple individual
small models, gave the best performance compared to other models. The ROC curves for three
traditional feature reduction methods (PCA, Boruta, and CPH) and the proposed risk score-based model
are shown in Figure 4.4, and summaries are listed in Table 4.3.
Table 4.3: Summary table for models using four feature reduction methods.
PCA Boruta Cox-Regression Risk-score
AUC 0.72 0.56 0.66 0.83
p value
Compared to the ROC of
the Risk-score method
0.049 0.0015 0.015
72
Figure 4.4. ROC curves of models using four feature reduction methods.
A. ROC curve for PCA based fusion method, AUC = 0.72.
B. ROC curve for Boruta based feature reduction method, AUC = 0.56.
C. ROC curve for CPH based feature reduction method, AUC = 0.66.
D. ROC curve for risk-score based feature fusion method, AUC = 0.83.
73
4.5 Discussion
In this study, we proposed a novel risk score-based feature reduction and fusion method for a prognosis
model and compared it to three different feature reduction methods in PDAC CT settings for pre-defined
radiomics and deep transfer learning feature banks. We discovered that, the proposed risk score-based
method (stacked model) had a better prognosis performance than those of traditional supervised and
unsupervised methods. This result is consistent with previous studies that ensemble methods can
outperform traditional machine learning models (Breiman & Leo, 1996; Dietterich, 2000; Rokach,
2005). Although each individual model based on PyRadiomics, LungRes and ImgRes is not strong, the
final model had relatively better performance.
As transfer learning increasingly plays a vital role in medical image analysis, the curse of dimensionality
is becoming more acute in radiomics-based prognosis models (Lao et al., 2017). Supervised feature
reduction methods such as univariate CPH and Boruta have difficulties in balancing false positive rate
and statistical power. By testing 277 features using univariate CPH, the probability of having at least one
false positive is higher than 99%. Hence, supervised feature reduction methods lose their significances
as feature banks continuing to grow. In addition, unsupervised methods including PCA and Independent
Component Analysis (ICA) are not able to boost the prognosis performance due to the inherent noise in
images features. On the other hand, ensemble methods, which use multiple models to generate risk
scores, may overcome these limitations of the traditional feature reduction methods. Additionally, since
risk-scores were generated using the non-linear classifier Random Forest, these risk scores were non-
linear mappings of the original feature space, which provide better fits for patients’ survival patterns. In
our study, using PDAC CT images, the proposed stacked methods have significantly higher AUC
compared to other feature fusion and reduction methods including PCA (p value = 0.049), Boruta (p
value = 0.0015), and Cox-Regression (p value = 0.015).
It is worth to note that although most deep radiomics features are independent from engineered
PyRadiomics features, there exist significant Pearson correlations coefficients between certain deep
radiomics and PyRadiomics features. This result suggests that the relationship between deep radiomics
and PyRadiomics is complementary. Since most deep radiomics features do not have a linear
relationship with engineered radiomics features, fusing these two feature banks would provide more
information to the prognosis model. On the other hand, existing correlations between first-order
74
radiomics and deep radiomics features suggest that through backpropagation, pre-trained CNNs were
also able to capture associations between first-order features and patients’ outcomes.
Although the proposed ensemble method outperforms traditional approaches, it has limitations.
Compared to supervised methods where certain biomarkers can be identified during the process,
ensemble methods are hard to interpret since the stacked model is based on the results (probabilities)
from other models. Although using intuitive algorithms such as logistic regression instead of Random
Forests, the final prognosis probability (risk score) can be derived from original features using
mathematical formulation, it would be a complicated task. In addition, the current models provide binary
outcomes as final outputs, ignoring the time to the event. It would be better to include time duration to
further improve the prognosis model.
4.6 Conclusion
We compared the proposed risk score-based prognosis model to three traditional feature reduction
methods and found that the proposed ensemble method has the best performance in prognostication
tasks for resectable PDAC patients, elevating the AUC from 0.74 to 0.83. The proposed model exploits
the state-of-the-art deep transfer learning methods and combines them with pre-defined radiomic
features to significantly improve the prognostic performance.
75
Chapter 5: Study 3
Title: CNN-based Survival Model for Pancreatic Ductal Adenocarcinoma in Medical Imaging
Authors:
# Name Affiliations
1 Yucheng Zhang 1,2
2 Edrise M. Lobo-Mueller 3
3 Paul Karanicolas 4
4 Steven Gallinger 2
5 Masoom A. Haider 1,2
6 Farzad Khalvati 1,2
Affiliations
1: Department of Medical Imaging, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
2: Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
3: Sunnybrook Research Institute, Toronto, ON, Canada
4: Department of Surgery, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON,
Canada.
76
5.1 Abstract
Cox proportional hazard model (CPH) is commonly used in clinical research for patient survival
analysis. However, the underlying linear assumption of CPH model limits its performance. In
medical imaging, the radiomics pipeline, which is based on imaging feature extraction and
analysis, is used in combination with the CPH model for survival analysis. Nevertheless, the
multicollinearity of radiomic features and multiple testing problem further impedes the
performance such models. In this work, a convolutional neural networks (CNNs) based survival
model was built and tested in a typical small dataset setting in resectable PDAC cohorts (n=98).
The CNNs-based survival model outperforms the traditional CPH-based radiomics approach in
terms of concordance index by 42%, providing a better fit for patients’ survival patterns.
5.2 Introduction
As a statistical method, survival analysis is used in clinical research to identify potential risk factors
or biomarkers for a variety of clinical outcomes including patient survival for different diseases
such as cancer. Cox proportional hazard model (CPH) is one of the most commonly used survival
analysis tools (Fox & Weisberg, 2011; B. George, Seals, & Aban, 2014). CPH is a semiparametric
model that calculates the effects of features (independent variables) on the risk of a certain event
(e.g., death) (Cox, 1972). For example, CPH measures the effect of tumor size on the risk of death.
The CPH-based survival models can help clinicians make more customized (personalized)
treatment decisions for individual patients. However, CPH models assume that the independent
variables (features or biomarkers) make a linear contribution to the model, with respect to time. In
many conditions, this assumption oversimplifies the relationships between biomarkers and
outcomes, especially in cancer diseases with poor prognosis. With a limited sample size, the
violation of linear assumption is not obvious and may be overlooked. However, as data sizes
increase, the violation of linear assumption in CPH models increasingly becomes more obvious and
problematic diminishing the reliability of such models (Kattan, Hess, & Beck, 1998).
In most cases, non-linear risk models can provide a better fit for survival function. There are mainly
three types of non-linear survival models: (i) classification methods, (ii) time-encoded methods,
77
and (iii) risk-prediction methods (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).
Classification methods solve the nonlinearity by using a classifier including such as Random Forest
or Support Vector Machine (SVM). Although these classifiers perform well in nonlinear scenarios,
they discard the duration information in modelling, which may lead to unreliable model.
For diseases with poor prognosis, classification methods are also prone to biased predictions due to
imbalance outcomes (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Time-encoded methods
separate a long time-interval into multiple fragments and make predictions for each segment.
However, the performance of time-encoded models is usually not comparable to traditional CPH
models because they are based on multinomial classification and take the duration into account only
partially. Risk-prediction models which are based on artificial neural networks (ANNs) learn
complex and nonlinear relationships between prognostic features and an individual’s risk for a
given outcome. Therefore, the ANNs-based model can provide an improved personalized
recommendation based on the computed risk.
Nevertheless, previous studies have demonstrated mixed results on ANNs performance in survival
analysis where it has been shown that in many cases, ANNs have not outperformed standard
methods for survival analysis (Mariani et al., 1997; Sargent, 2001; Xiang, Lapuerta, Ryutov,
Buckley, & Azen, 2000). This may be due to the small sample size and limited feature space leading
to ANNs models that are underfitted. To exploit the ANNs architecture and successfully apply them
to complex cases, larger datasets are required. Recent work has shown that, given enough sample
sizes, ANNs can, in fact, outperforms traditional CPH survival models (Ching, Zhu, & Garmire,
2018; Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).
In medical imaging, researchers have been working to extract diagnostic or prognostic features from
medical images in different modalities (V. Kumar et al., 2013; van Griethuysen et al., 2017; Yip
& Aerts, 2016). Efforts have been made to standardize these quantitative imaging features
(radiomic) by implementing open source libraries such as PyRadiomics (van Griethuysen et al.,
2017). These feature banks contain thousands of hand-crafted formulas, designed to extract the
distribution or texture information. Subsequently, these features are often tested by CPH models
selecting significant features and building the final survival model (Y. Huang et al., 2016; Lao et
78
al., 2017). However, the high dimensionality nature of radiomics features introduces serious issues
in feature reductions and prognosis performance.
Through a standard radiomics feature bank, more than 1000 features can be extracted from ROI.
Given the high dimensionality of features, multiple testing in CPH models becomes a challenge
(Yip & Aerts, 2016). In addition, the proposed feature sets are often highly correlated due to the
similarity of formulas. Despite the linear assumption in CPH modelling, the multicollinearity in the
feature space further impedes the performance. The limitations of the handcrafted radiomic features
and the fact that ANNs outperform traditional CPH models, and given the recent advances in deep
learning, motivate designing a novel approach for survival modelling that combines CPH with state-
of-the-art deep learning algorithms for improved performance.
Previous work on deep learning based for survival analysis including DeepSurv and NNET-survival
are all ANNs-based survival models with modified loss function to capture more accurate survival
patterns (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016). These models take features (e.g.,
radiomic features) as input and return risks for patients at a given timepoint. However, as discussed
above, feeding radiomics features into these ANNs as input is not the optimal solution due to the
multicollinearity issue. In this research, we use medical images as input, replacing conventional
feature extractors with a Convolutional Neural Networks (CNNs) architecture to extract disease-
specific image features which are associated with survival patterns. We hypothesized that CNNs
will extract more meaningful features and combined with nonlinear loss function, the proposed
approach will provide a better fit for survival patterns.
As the most well-known architecture in deep learning, CNNs recognize imaging features by
applying multiple layers of convolution operations to the images (B, 2013; Litjens, Kooi, Bejnordi,
Setio, et al., 2017) where the weights of the convolution filters are finetuned during training via
backpropagation process (Horn, Auret, McCoy, Aldrich, & Herbst, 2017). Thus, given sufficient
data, CNNs can be used to extract imaging features that are disease-specific, which can be used for
diagnosis or prognosis purposes (Yosinski et al., 2014). Although traditional medical imaging-
based CNNs use binary or multinomial classification loss function, similar to ANNs-based survival
models, the loss function can be modified to also capture the survival patterns by accounting for
survival duration. By doing so, CNN can be tuned to extract features that are associated with the
79
risk of the outcome in a certain duration. We hypothesized that the proposed CNN-based Survival
(CNN-Survival) model with ANNs loss function will outperform conventional radiomics and CPH-
based prognosis model.
5.3 Methods
5.3.1 Data
In order to gather sufficient data to train the proposed CNN-Survival model, CT scans along with
patient outcome (survival and time to death) from three cohorts were extracted. Cohort 1 consists
of 422 Non-small cell lung cancer (NSCLC) patients (Ganeshan et al., 2012). We used this data to
pre-train the CNNs since it has the largest sample size. Cohort 2 has 68 pancreatic adenocarcinoma
(PDAC) patients, which was used to finetune the final layers in the proposed CNN-Survival
architecture. Cohort 3, which is the test data, consists of 30 PDAC patients enrolled in another
independent hospital site (Eilaghi et al., 2017). For all the patients in these three cohorts, CT scans,
annotations (contours) of tumor performed by radiologists, and survival data were available. The
institutions’ Research Ethics Boards approved these retrospective studies and waived the
requirement for informed consent. All methods were carried out in accordance with relevant
guidelines and regulations.
5.3.2 Architecture of the proposed CNN-Survival
A CNN architecture with six-layered convolutions (CNN-Survival) was trained using images from
Cohort 1 as shown in Figure 5.1. Input images have dimensions of 1401401 (grey scale), which
contain the CT images within the manual contours of the tumors (example shown in Figure 5.2).
The first two convolutional layers have kernel size 33 with 32 filters. After a max pooling layer
of size 22, features passed through another two convolutional layers with same kernel size but 64
filters. In the end, passing through another max pooling layer, features went through the final two
convolution layers which have 128 filters. To avoid overfitting with this small sample size, dropout
layers were added after every two convolutional layers. Finally, passing through the flatten layer,
80
images were converted into 25, 088 features, where survival probabilities for a given time t were
calculated.
After training with Cohort 1, all the Conv-2D layers of the pre-trained model were frozen as feature
extraction layers. During the transfer learning process, the last dense layer would be finetuned by
PDAC images from Cohort 2.
5.3.3 Loss Function
To better fit the distribution of survival data, a modified loss function, proposed in (Gensheimer
& Narasimhan, n.d.), was applied to the CNNs architecture (Equation 1).
𝑙𝑜𝑠𝑠 = − ∑ ln(ℎ𝑗𝑖) − ∑ l n(1 − ℎ𝑗
𝑖)𝑟𝑗
𝑖=𝑑𝑗+1 𝑑𝑗
𝑖=1 (1)
In the formula above, ℎ𝑗𝑖 is the hazard probability for individual i during time interval j. 𝑟 stands
for individuals “in view” during the interval j (i.e., survived in this period) and 𝑑 means a patient
suffered a failure (e.g., death) during this interval. The overall loss function is the sum of the losses
for each time interval (Gensheimer & Narasimhan, n.d.).
5.3.4 Training process and Transfer Learning
Training a CNNs-based survival model needs to finetune a large number of features. Given this
simple CNNs architecture, there were 1,091,699 trainable parameters. As such, the larger dataset,
cohort 1, was used to pre-train the network. In the cohort, 422 patients had 5,479 slices containing
manually contoured tumor regions. However, the region of interest (ROI) on some of the slices
were so small as shown in Figure 5.3. To solve this, we rank those slices using their ROI size and
picked the top 3,000 slices.
81
These 3,000 slices were fed into the CNNs model. After training the initial model, all the weights
in the pre-trained model were frozen except for the dense layers. Later, 68 patients from Cohort 2
were used to finetune these two dense layers which contain 627 parameters. Although PDAC and
NSCLC images are all from CT, these two diseases have different survival patterns and hence, we
hypothesized that transfer learning with finetuning the final layers was necessary to optimize the
performance of the networks for modeling PDAC survival.
5.3.5 Traditional Radiomics analytic pipeline
In Cohort 2 and Cohort 3, 2D radiomics features were extracted from the manually contoured
regions using PyRadiomics library (version 2.0), generating 1,676 features in total (van
Griethuysen et al., 2017). These features were selected using lasso-CPH (Tibshirani, 1997) in
Cohort 2. The significant features were later tested in Cohort 3. The performance was measured by
the concordance index, which will be further compared with that of the CNN-Survival.
82
Figure 5.1 The proposed CNN-Survival architecture: 6-layer CNN batch normalization (BN)
and max pooling layers. There are also three dropout layers to control the potential overfitting.
83
Figure 5.2 Example of the input CT images
Left: NSCLC tumor from Cohort 1. Right: PDAC tumor from Cohort 2)
Figure 5.3 Example of the small ROI in Cohort 1
84
5.4 Results
Pre-training the proposed CNN-Survival with Cohort 1, given learning rate of 0.0001, the loss
decreased significantly with the first ten epochs where the loss of training and validation sets
converged as shown in Figure 5.4. In the transfer learning process, training and testing group also
converged very quickly and loss comes to the same level as the pertained model.
Figure 5.4 Loss changes during pre-training
Concordance index (CI) was used to measure the fit of the survival function. In cohort 1, CNN-
Survival has a CI of 0.684. In the testing (Cohort 3), the proposed model achieved CI at 0.628.
Traditional radiomics approach yields CI of 0.442 using lasso-cox-regression. This result indicates
that the CNN-Survival model provided a better fit for survival patterns compared to the
conventional radiomics analytic pipeline.
85
Table 5.1: Results of two approaches for the concordance index
Cohort 1 Cohort 3
CPH 0.442
Proposed CNN-Survival 0.684 0.628
As discussed above, CNN-Survival could depict the survival probability of a patient at given time.
We plotted the survival probability of two patient (one survived versus one deceased) in the testing
cohort in Figure 5.5 and Figure 5.6.
Figure 5.5: Survival probability curve generated by the CNN-Survival for a patient who died 511
days after CT using the ROI from CT scans
86
Figure 5.6: Survival probability curve generated by the CNN-Survival for a patient who survived
2415 days from CT using ROI from CT scans
5.5 Discussion
Using the proposed CNN-Survival model, the prognosis performance is further improved. Deep
learning networks provide flexibility in modifying the dimension of feature space and loss function,
enabling us to extract disease-specific features and build more precise models. Using a CNN-based
survival model, we showed that, with the help of transfer learning, deep learning architectures can
outperform traditional pipeline in a typical small sample size setting in modelling survival for
PDAC patients. The proposed transfer learning-based CNN-Survival models have huge potential.
For example, researchers could pre-train a model using images from common cancers with larger
datasets and transfer this model to target rare cancers. Transfer learning-based CNN-Survival model
mitigates the needs for large sample size, allowing more the model to be applied to a wide range of
cancer sites.
87
The proposed CNN-Survival model provides better performances compared to the traditional
radiomics analytic pipeline. With the modified loss function, CNN-Survival does not rely on the
linear assumption, making it suitable for more real-world scenarios. In the testing cohort, the
proposed CNN-Survival achieved a concordance index of 0.628. Although there was no prior work
in the PDAC field, the concordance index of our proposed CNN-Survival is comparable to the
typical CI for biomedical applications (Schmid, Wright, & Ziegler, 2016).
In this research, due to the small sample size in PDAC cohorts, the proposed CNN was not optimal.
We used CT images from 68 patients to finetune the pre-trained CNN-Survival and tested in another
30 patients. Although through transfer learning, most of the parameters were trained using the pre-
train cohort, there were still a large number of parameters needed to be modified through finetuning.
Consequently, the small sample size may hamper the process. Thus, a larger dataset was available,
performance may be further improved. Additionally, the pre-trained domain is CT images from
NSCLC patients. Although it is the largest open source dataset we could find, Non-Small Cell Lung
Cancer has different biological background and survival patterns compared to PDAC. In future
research, using a similar pre-trained domain and a larger fine-tune cohort, further improvement can
be expected.
5.6 Conclusion
The proposed CNNs-based survival model outperforms traditional radiomics pipeline in PDAC
prognosis. This approach offers a better fit for survival patterns based on CT images and overcomes
the limitations of conventional survival model.
88
Chapter 6: General Discussion
6.1: Study 1
6.1.1 Discussion
In the past a few years, a large number of prognostic radiomics markers have been found for
different types of cancers (Cozzi et al., 2019; Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al.,
2019; D. Kumar et al., 2015; V. Kumar et al., 2013; van Griethuysen et al., 2017; Yucheng
Zhang et al., 2017). However, radiomics-based prognosis models have often displayed limited
performance (Aerts et al., 2014; Yucheng Zhang et al., 2017). Although a large number of
features have been identified, most of these features are highly correlated. In Pancreatic Ductal
Adenocarcinoma (PDAC) prognostication, it has been found that “dissimilarity” and “inverse
difference normalized” are significantly associated with clinical outcomes (Eilaghi et al., 2017).
However, these two features are reciprocal to each other. Under this condition, building a model
using “dissimilarity” alone, would have a similar performance as a model with two features since
the additional features fail to add any further information. In practice, the multicollinearity of
radiomic features harms the performance of prognosis models.
Transfer learning methods have shown potentials in image recognition tasks, especially in
studies with small sample sizes (Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou,
Meng-Hsi Wu, n.d.; Ravishankar et al.). In the medical imaging domain, it has been shown that
prognosis models using transfer learning methods achieve better performance (Lao et al., 2017).
However, most of the transfer learning models used in medical imaging are ImageNet pre-
trained. As a natural images database, ImageNet contains color scale images with R, G, B
channels. In order to apply the ImageNet pre-trained models, researchers often make copies of
grey scale medical images into R, G, B channels. This is not optimal since color information in
these channels is an important feature in ImageNet pre-trained CNNs. Compared to natural
images, medical images from CT or MR have different signal to noise profiles. A kernel, which
is able to extract texture information from the natural images, may lose its function in medical
images. Thus, directly adopting the ImageNet pre-trained models may not be the optimal
method.
89
In this study, we trained three prognosis models for PDAC, using the PyRadiomics based model,
the ImageNet pre-trained ResNet, and the Lung CT image pre-trained ResNet models. The first
two feature banks are commonly used in previous medical imaging studies. We have shown that,
for PDAC prognosis tasks, Lung CT image pre-trained ResNet provides the most informative
features and significantly higher prognosis performance compared to the other two feature banks.
This result suggests that medical imaging-based pre-trained CNNs may serve as high-
performance feature bank for future studies in cancer prognosis. Additionally, building an open-
source medical imaging pre-trained CNN would potentially benefit further medical imaging
studies for prognostication of cancer.
6.1.1.1 Feature analysis
It has been shown that radiomics features have significant associations with the overall survival
in resectable PDAC patients (Eilaghi et al., 2017; Khalvati, Zhang, Baig, et al., 2019). In this
study, we found that, among 1,428 radiomics features, 283 features had significant associations
with OS. However, due to a large number of tests, multiple comparison problem becomes
unavoidable. After FDR or Bonferroni control, none of those factors remained significant. To
reduce the number of comparisons, Khalvati et al. implement an interclass correlation (ICC)
filter before the feature analysis (Khalvati, Zhang, Baig, et al., 2019). However, the ICC filter
method needs at least two readers which is not feasible for many studies. To address this issue,
multi-center studies are needed to eliminate unstable features and fundamentally reduce the
number of testing.
For transfer learning features, the ImageNet pre-trained feature extractor generated 49
significant features, while the Lung CT based feature extractor produced two significant features.
Although the number of significant features was below that of PyRadiomics, as discussed above,
comparing the number or ratio of significant features is not appropriate. It has been shown that a
high number of features does not necessarily lead to a high-performance prognosis model
(Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). Hence, comparing prognosis
90
performance on the same but independent validation cohort should the gold standard in
comparing the performance of different feature banks.
6.1.1.2 Prognostic model performance
This study is one of the first studies which tested the performance of prognosis models in
resectable PDAC cohorts (Khalvati, Zhang, Baig, et al., 2019). In other types of cancers,
radiomics-based prognosis models achieve AUC ranging from 0.55 to 0.9 (Hawkins et al., 2016;
Huynh et al., 2016; Parmar, Grossmann, et al., 2015; van Griethuysen et al., 2017; Yucheng
Zhang et al., 2017). Using traditional radiomics analytics pipeline proposed by Zhang et al.,
PyRadiomics prognosis model achieved AUC of 0.57 in the validation cohort (Yucheng Zhang et
al., 2017). The AUC was lower than that of other radiomics studies. Sample size limitation may
contribute to the lower performance. Given that our training set has only 68 patients, which is
significantly lower than most radiomics studies, the prognosis model may be undertrained (V.
Kumar et al., 2013; van Griethuysen et al., 2017).
For the transfer learning model, ImgRes achieved AUC of 0.71 and LungRes yielded 0.74 AUC.
Although the transfer learning feature extraction produced a smaller number of significant
features, their prognosis performance was significantly higher than that of the PyRadiomics
model (AUC = 0.57). Hence, future research on radiomics should not only focus on reporting the
significance of image features but also report the amount of variation they explain. We have
shown that, the LungRes-based prognosis model performed significantly better than the
PyRadiomics model. However, due to the sample size limitations, we did not have enough
statistical power to test whether a significant difference exists between the ImgRes and LungRes
model. Further research should concentrate on this issue.
6.1.1.3 Risk score
Many radiomics studies proposed image feature based risk scores for different types of cancers
(Cozzi et al., 2019; Khalvati, Zhang, Baig, et al., 2019; Lao et al., 2017). Risk scores can be
91
derived from logistic regression or other parametric or semi-parametric methods. Patients can
then be divided into low-risk or high-risk groups using medians of the scores. In the end, Cox
Proportional Hazard model is often used to compare if any significant difference in survival
patterns exists between the two groups. Compared to the binary prognosis model, risk score
analysis takes the duration into account, and presents it in Kaplan Meier curves which are more
informative and interpretable (Cozzi et al., 2019; Lao et al., 2017).
In this study, we produced three sets of risk scores using models trained from three different
feature banks. We discovered that, the PyRadiomics and the ImgRes-based risk scores are not
significantly associated with the overall survival in the independent validation cohort. In
contrast, the LungRes-based risk score had a significant p value. This suggests that medical
images pre-trained CNNs were not only able to provide binary predictions on survival, but also
good at offering precise prognosis with respect to time.
6.1.2 Strength and limitations
6.1.2.1 Strength
Although previous studies have identified few radiomics features for the PDAC prognosis, the
performance of those features has not been validated (Eilaghi et al., 2017; Khalvati, Zhang, Baig,
et al., 2019). In this study, we tested the performance of the PyRadiomics feature bank by using
two independent cohorts and demonstrated the limitation of the current radiomics pipeline.
Lacking validation is a common problem in radiomics studies. We are one of the first groups to
provide cross-center validations for radiomics features in the context of resectable PDAC. Cross-
center validation produces more reliable results and should become a standard protocol in future
radiomics studies.
In addition to independent validation cohorts, in this study, we highlighted that, transfer learning
methods provided prognostic features for OS in PDAC patients. Additionally, we confirmed that,
transfer learning model could achieve comparable prognosis performance with a small sample
size (n < 100). Transfer learning methods have enormous potential for rare diseases, or studies
using limited datasets. Our study confirmed that, in prognostication, the medical image pre-
92
trained model has a comparable or a higher performance when compared to the ImageNet pre-
trained model, even though the medical image pre-trained model was tuned by less than 1000
images. Hence, medical image pre-trained CNNs are remarkably valuable for medical imaging
studies, providing high-performance feature banks, and working well in small sample settings.
6.1.2.2 Limitations
Ad discussed above, one of the limitations of this study was its sample size. Due to the small
sample size in validation cohorts (n=30), we were unable to compare the prognosis performance
between the ImgRes and LungRes models in terms of AUC. Although we found a significant
difference in the risk score analysis, ROC curve comparison was not significant between
LungRes and ImgRes. Further research with a larger dataset is required to further investigate the
performance of these transfer learning models.
Another limitation of this study is the lack of feature fusion. It has been shown that, radiomics
feature and transfer learning could be fused together to achieve better performance (Lao et al.,
2017). Further investigation is needed to find an optimal way of fusing radiomics and transfer
learning features.
6.1.3 Implications
Compared to radiomics features where formulas are manually defined, transfer learning features
provide target-specific features which lead to the high-performance in the prognosis model. In
image related tasks, the ImageNet pre-trained model has been very popular. However, the unique
nature of medical imaging requires another well-trained transferable model. Hence, follow-up
research should focus on developing a medical image pre-trained model which will benefit future
studies for prognosis models.
Additionally, as more studies started to adopt transfer learning methods, the dimension of
features has expanded at a very fast pace. Investigating the relationship between radiomics and
93
transfer learning features and finding the optimal way to reduce features are urgent issues for
subsequent studies.
6.2: Study 2
6.2.1 Discussion
Radiomics is a rapidly evolving field of study. In the past decade, feature size has expanded from
less than one hundred to a few thousand (van Griethuysen et al., 2017). With the addition of
transfer learning features, feature dimensions will continue to grow (D. George et al., 2017). As
the number of features increases, it is expected to have extra information which was not available
in previous feature banks. However, rapidly growing feature banks worsens the multiple
comparison and multicollinearity problems which already exist in the current radiomics pipeline.
Further studies are required to investigate the relationship between engineered PyRadiomics
features and transfer learning features. Will transfer learning feature extractors produce similar
features to PyRadiomics, or, will the PyRadiomics and transfer learning feature extractors have
entirely different formulas? Answers for these questions will contribute to this evolving field in
this transition period. Furthermore, it was hypothesized that, feature fusion would improve the
prognosis performance since it adds more information. However, identifying the optimal way of
fusing features is still required.
In this study, we extracted features using the PyRadiomics, the ImgRes and the LungRes feature
banks, investigated their correlations, proposed a risk score-based fusion method, and compared
their performances with those of the other feature fusion methods. We found that, the risk-score
based fusion method provides the best prognosis performance. It has been shown that, building
multiple models improves the classification accuracy and our results confirmed this in the
context of medical imaging (P. Yang et al., n.d.).
94
6.2.1.1 Correlation between radiomics and transfer learning features
Multicollinearity is a major limitation in radiomics studies (Parmar, Grossmann, et al., 2015;
Yucheng Zhang et al., 2017). In this study, we highlighted that, on average, PyRadiomics
features have high correlations among its features. There are features which have correlation
coefficients higher than 0.7, showing strong correlations with other PyRadiomics features. For
example. “gradient_glcm_ClusterTendency” and “gradient_firstorder_variance” have a
correlation coefficient of 0.998. These high correlations may explain the weak performance of
the PyRadiomics feature bank in prognosis tasks without proper preprocessing. Comparatively,
the ImgRes and LungRes generated features have lower average correlation coefficients.
In addition, we calculated the correlations between these three feature banks. We have shown
that most transfer learning features and PyRadiomics features had no or weak associations
between each other, suggesting that feature fusion may provide additional information to the
prognosis model and improve its performance. However, few feature pairs existed with strong
associations. We noticed that the first order radiomics features generally have higher correlations
with transfer learning features. This result suggested that, some transfer learning features were
able to capture information similar to first order as PyRadiomics features. Through
backpropagation, CNN realized the benefits of capturing first order features and updated its
kernels in this way.
By investigating the correlation between three feature banks, we now have a better understanding
of their associations. PyRadiomics features and transfer learning features should be considered as
complementary rather than replacements of each other. Although the application of transfer
learning will be increasingly common in medical imaging studies, engineered radiomics features
are also valuable and should not be discarded.
6.2.1.2 Prognosis performance for different fusion methods
In previous radiomics studies, it has been shown that feature fusion can improve the model’s
performance (Parmar, Grossmann, et al., 2015; Yucheng Zhang et al., 2017). Inspired by multi-
stream CNNs, we proposed a new risk-score based fusion method and compared its performance
95
in PDAC prognostication to that of other feature fusion methods. This model achieved higher
AUC in prognosis tasks (AUC = 0.83).
It should be noted that, given the fast-growing nature of feature banks, supervised feature
selection methods may fail to provide informative guidance due to the multiple comparison
problems. In our study, three feature banks generated more than two thousand features. Testing
these features with clinical outcome individually, we have more than 99% chance to have at least
one false positive. Additionally, the multicollinearity issue is worsened after supervised feature
selection. If one feature is significant, then other similar features will be more likely to be
significant. Given that, supervised feature selection gives a selected feature space with a high
number of correlated and false positive features, hampering the model’s performance.
As an alternative feature selection method, Boruta performed well in other domains. However, in
this study, we found that Boruta feature selection method delivered the worst result. One reason
behind that is the sample size issue. Since the sample size is small, Boruta algorithm lacked the
statistical power to differentiate between random features and meaningful features. As a result,
only two features were selected by Boruta even when the cut off was evaluated from 0.05 to 0.1.
Since these two features failed to explain much variance in patients’ outcomes, the Boruta
guided model yielded the worst performance among the four fusion methods. In a large sample
size setting, Boruta guided feature reduction may be able to achieve acceptable results.
It has been shown that, given a large feature size, unsupervised feature fusion methods provide
the best performance (Yucheng Zhang et al., 2017). Our research confirmed these previous
findings. PCA based feature fusion method achieved similar performance compared to models
discussed in Study 1. It is interesting to note that, when testing the prognosis performance for
each feature bank, LungRes had AUC of 0.74, while the PCA based feature fusion model had
similar AUC (AUC = 0.72). This result indicated that, the PCA based feature fusion model was
able to collect and fuse useful information in a small number of components, reducing the
computation time and the overall complexity of the model.
In the end, the risk-score fusion method provided the best overall performance in the PDAC
prognosis. Since risk-score was generated by the random-forest, these scores are non-linear
mappings of the original feature space. Hence, they provide a better fit for complex patterns. In
96
future medical imaging studies, as more transfer learning-based feature banks are established,
risk-score based fusion will play an important role.
6.2.2 Strength and limitations
6.2.2.1 Strength
We were one of the first groups to investigate the relationships between transfer learning and
radiomics features for PDAC in CT images. The correlation mapping is valuable for future
medical imaging-based transfer learning studies. We identified that transfer learning features
could resemble certain first-order radiomics features which depict the shape and distribution of
pixel intensities. These correlations may become the foundation of fusing radiomics and transfer
learning features.
Furthermore, we tested the prognosis performance of the proposed fusion method with other
three existing feature reduction methods in the PDAC prognosis. The best-performed model
achieved AUC of 0.83 which is currently highest in this subject, outperforming other prognosis
biomarkers including image marker and CA19-9. Besides that, this AUC was achieved in an
independent validation cohort, avoiding the common circular reasoning problem.
In the end, the high performance of risk-score fusion methods demonstrated its potential for
cancer prognosis. As more researches start to realize the importance of transfer learning, more
transfer learning feature banks will be developed in the medical imaging field. Instead of
selecting features from those feature bank through supervised feature reductions, fusing features
using model-generated risk-scores may provide a better performance.
6.2.2.2 Limitations
The main limitation of this study is sample size. Given the small sample size (Total sample size:
98, sample size of training data: 68), the Boruta method did not distinguish meaningful features,
97
resulting in a lower performance model. Testing different methods using a larger dataset will
provide stronger evidence against the null hypothesis.
Secondly, although the risk-score method achieved the best performance, interpretation was
challenging. Compared to supervised feature selection methods where formulas can be
identified, the risk score comes from a non-linear combination of features. Building a model on
top of other models can be considered as a black box. Further investigation is required to address
this issue.
Finally, in this study, we used a binary outcome (survival vs. death). For cancer with poor
prognosis, the binary prognosis may not be meaningful. A model, which provides probability at
any given time, may be more practical and translational.
6.2.3 Implications
We have shown that, transfer learning features and radiomics features have a complementary
relationship. Although transfer learning features may achieve higher performance in specific
tasks, radiomics studies are still valuable. Comparing the results using transfer features only, the
fusion method had better performance, confirming the additional value of radiomics features.
As more studies in the medical imaging field started to adopt transfer learning methods, an
increasing number of feature banks will become available. Our results suggested that the
proposed risk-score feature fusion method may become the standard protocol in the deep
radiomics analytic pipeline. Additionally, given the typical large P small N dataset in radiomics
studies, caution should be taken when applying supervised feature selection approaches.
6.3: Study 3
6.3.1 Discussion
As a statistical method, survival analysis is commonly used in clinical research, depicting
patients’ survival patterns and identifying potential risk factors (Lao et al., 2017). As a
98
semiparametric model, the Cox Proportional Hazard model is often used in translational
research. However, the CPH model assumes that features make a linear contribution to the risk,
oversimplifying the relationship between biomarkers and outcomes. In this study, we used a
modified loss function in CNN and achieved higher concordance index when compared to that of
radiomics markers.
It has been shown that modified loss function in deep learning architectures will make a stronger
model compared to traditional CPH (Gensheimer & Narasimhan, n.d.; Katzman et al., 2016).
Non-linear activation layers in current deep learning architectures provide a non-linear mapping
of input and output, enabling the CNN-Survival architecture to model non-linear survival
patterns. However, in previous studies, the proposed deep learning-based survival models take
vectors as inputs, including age, gender, and other clinical factors. In this study, through transfer
learning, we were able to train a deep learning-based survival model (CNN-Survival) using CT
images. Kernels in this prognosis model were optimized for extracting features related to overall
survival. Hence, the CNN-Survival model outperforms traditional radiomics model in terms of
concordance index.
Training a six-layer CNN-Survival model needs to tune more than 1 million parameters. This is
not feasible in most small sample studies. However, using transfer learning, top convolution
layers can be trained using another dataset with a larger sample size. Compared to other transfer
learning studies which require medical images and binary outcome, pre-training a CNN-Survival
needs not only binary outcome but also duration information. This is costly to collect. In this
study, we used a popular Non-Small Cell Lung Cancer dataset which contains CT images and
survival data for 422 patients. In the future, if larger dataset (sample size > 1000) becomes
available, the CNN architectures can be deeper compared to the six-layer CNN used in this
study. Hence, the pre-trained model would be able to extract more informative features, further
improving the prognosis performance.
Future radiomics studies will evolve from manually defined features to a fusion of pre-defined
features and transfer learning features. Additionally, the prognosis models will progress from
binary prognosis models to hazard probability models. Consequently, new survival modelling
methods are required to handle complex tasks. This CNN-Survival model made a step towards
99
this goal. With larger pre-trained datasets and validation cohorts, the proposed model has the
potential of being a standardized process in future deep radiomics analytics pipeline.
6.3.2 Strength and limitations
6.3.2.1 Strength
We were the first group to use a modified loss function and transfer learning method in training a
CNN-based survival model for PDAC CT images. In an independent validation cohort, our
CNN-Survival achieved a better concordance index when compared to that of traditional
radiomics models. Our results suggested that, instead of building binary prognosis models,
training a CNN based survival model was also feasible in a small cohort with the help from
transfer learning.
In addition, the LungRes pre-trained CNN used in Study 1 and Study 2 was tuned using binary
prognosis (survival versus death). In this study, the pre-trained model was optimized based on
survival probability at a given time. By doing so, deep radiomics features extracted from this
pre-trained CNNs would be theoretically associated with patients’ survival patterns, offering
precise prognosis information for healthcare professionals.
6.3.2.2 Limitations
As discussed above, training a CNN-Survival needs not only the binary outcome, but also
duration information which is difficult to collect. The largest related open-source data we could
find had 422 patients collected by Aerts et al. (Aerts et al., 2014). In order to fully take
advantage of this data, instead of taking the largest ROI from a patient, we extracted ROI from
every slice with expert’s annotations. In the end, we gathered 5,000 ROI from those 422 patients.
However, these ROIs are not independent since many of them are adjacent to each other. In
addition, some of the ROIs can be extremely small, making it difficult to extract useful features.
Although a large dropout rate was applied to control the overfitting, the model started to overfit
after 20 epochs. With a larger dataset (sample size > 1000), this issue can be resolved.
100
Chapter 7: Conclusions
Our studies showed that, for resectable Pancreatic Ductal Adenocarcinoma, transfer learning-
based deep radiomics features had the potential of providing more accurate prognostications
compared to conventional manually-defined radiomics features.
Study 1 suggested that performances of transfer learning features were associated with the pre-
trained domain. The LungRes, which was pre-trained by Lung CT images, outperformed the
ImgRes, which was ImageNet pre-trained. This result indicated that, for future studies, medical
images pre-trained feature extractors would play a more important role. In study 2, through
fusing features from pre-trained feature extractors and the conventional PyRadiomics feature
bank, prognostication performance was further improved from 0.74 to 0.83 in terms of AUC,
indicating that transfer learning and conventional radiomic features carried different information
from medical images.
In the final study, through modifying the loss function in the CNN architecture, a CNN-Survival
model was trained. Taking CT scans as inputs and returning survival probabilities at given time,
the CNN-Survival had the potential of providing more practical information for healthcare
providers in designing personalized treatment plans for resectable PDAC patients.
These three studies provided pieces of evidence that transfer learning approaches had substantial
potential for the medical imaging field. Through transfer learning, more information can be
extracted from medical images, contributing to the improvement of prognosis performances in
our resectable PDAC cohorts.
101
Chapter 8: Future directions
Our research provided pieces of pieces of evidence that medical image pre-trained CNNs have
the potential to become standardized feature extractors. Compared to the ImageNet pre-trained
model, we have shown that, medical image pre-trained CNNs may be more suitable for medical
imaging tasks compared to CNNs pretrained with natural images. In this study, we adopted two
open-source Lung CT databases with 888 and 422 patients. If future studies adopt larger pre-
train datasets with CT images from more than 1000 patients, the transfer learning approaches
may have further improved performances. We anticipate that, in the near future, these high-
performance pre-trained CNNs will become standardized deep radiomics feature extractors.
Secondly, current deep radiomics researches are mainly focusing on the overall survival as the
outcome. However, diagnosis and predicting treatment response for PDAC and other types of
cancers are also extremely valuable to patients and healthcare professionals. Further deep
radiomics research should not only focus on the prognostication of disease but also aim to solve
other relevant research questions including early detection and designing personalized treatment
plans.
Finally, future deep radiomics studies should focus on visualization and interpretations. It is
critical to understand how deep radiomics features capture information, and which type of deep
radiomics features plays an essential role in the context of resectable PDAC prognostications.
This will assist researchers and healthcare professionals in translating deep radiomics studies into
clinical practice.
102
References
Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary
Reviews: Computational Statistics, 2(4), 433–459. https://doi.org/10.1002/wics.101
Abdullah, S. L. S., Hambali, H., & Jamil, N. (2012). An accurate thresholding-based
segmentation technique for natural images. In 2012 IEEE Symposium on Humanities,
Science and Engineering Research (pp. 919–922). IEEE.
https://doi.org/10.1109/SHUSER.2012.6269007
Adamska, A., Domenichini, A., & Falasca, M. (2017a). Pancreatic Ductal Adenocarcinoma:
Current and Evolving Therapies. International Journal of Molecular Sciences, 18(7).
https://doi.org/10.3390/ijms18071338
Adamska, A., Domenichini, A., & Falasca, M. (2017b). Pancreatic Ductal Adenocarcinoma:
Current and Evolving Therapies. International Journal of Molecular Sciences, 18(7).
https://doi.org/10.3390/ijms18071338
Aerts, H. J., Velazquez, E. R., Leijenaar, R. T., Parmar, C., Grossmann, P., Carvalho, S., …
Lambin, P. (2014). Decoding tumour phenotype by noninvasive imaging using a
quantitative radiomics approach. Nat Commun, 5, 4006.
https://doi.org/10.1038/ncomms5006
Afshar, P., Mohammadi, A., Plataniotis, K. N., Oikonomou, A., & Benali, H. (n.d.). From Hand-
Crafted to Deep Learning-based Cancer Radiomics: Challenges and Opportunities.
Retrieved from https://arxiv.org/pdf/1808.07954.pdf
Ahmad, N. A., Lewis, J. D., Ginsberg, G. G., Haller, D. G., Morris, J. B., Williams, N. N., …
Kochman, M. L. (2001). Long term survival after pancreatic resection for pancreatic
adenocarcinoma. The American Journal of Gastroenterology, 96(9), 2609–2615.
https://doi.org/10.1111/j.1572-0241.2001.04123.x
Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2018). Recurrent Residual
Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image
Segmentation. Retrieved from http://arxiv.org/abs/1802.06955
André, T., de Gramont, A., Vernerey, D., Chibaudel, B., Bonnetain, F., Tijeras-Raballand, A., …
de Gramont, A. (2015). Adjuvant Fluorouracil, Leucovorin, and Oxaliplatin in Stage II to
III Colon Cancer: Updated 10-Year Survival and Outcomes According to BRAF Mutation
103
and Mismatch Repair Status of the MOSAIC Study. Journal of Clinical Oncology, 33(35),
4176–4187. https://doi.org/10.1200/JCO.2015.63.4238
Antony, J., McGuinness, K., Connor, N. E. O., & Moran, K. (2016). Quantifying radiographic
knee osteoarthritis severity using deep convolutional neural networks. Quantifying
Radiographic Knee Osteoarthritis Severity Using Deep Convolutional Neural Networks.
Retrieved from http://arxiv.org/abs/1609.02469
Anwar, S. M., Majid, M., Qayyum, A., Awais, M., Alnowami, M., & Khan, M. K. (2018).
Medical Image Analysis using Convolutional Neural Networks: A Review. Journal of
Medical Systems, 42(11), 226. https://doi.org/10.1007/s10916-018-1088-1
Armato, S. G., McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., …
Clarke, L. P. (2011). The Lung Image Database Consortium (LIDC) and Image Database
Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans.
Medical Physics, 38(2), 915–931. https://doi.org/10.1118/1.3528204
Arnold, L. D., Patel, A. V., Yan, Y., Jacobs, E. J., Thun, M. J., Calle, E. E., & Colditz, G. A.
(2009). Are Racial Disparities in Pancreatic Cancer Explained by Smoking and
Overweight/Obesity? Cancer Epidemiology Biomarkers & Prevention, 18(9), 2397–2405.
https://doi.org/10.1158/1055-9965.EPI-09-0080
B, R. W. (2013). Advances in Neural Networks – ISNN 2013, 7951, 12–20.
https://doi.org/10.1007/978-3-642-39065-4
Bai, Y., Lin, Y., Tian, J., Shi, D., Cheng, J., Haacke, E. M., … Wang, M. (2016). Grading of
Gliomas by Using Monoexponential, Biexponential, and Stretched Exponential Diffusion-
weighted MR Imaging and Diffusion Kurtosis MR Imaging. Radiology, 278(2), 496–504.
https://doi.org/10.1148/radiol.2015142173
Ballehaninna, U. K., & Chamberlain, R. S. (2012). The clinical utility of serum CA 19-9 in the
diagnosis, prognosis and management of pancreatic adenocarcinoma: An evidence based
appraisal. Journal of Gastrointestinal Oncology, 3(2), 105–119.
https://doi.org/10.3978/j.issn.2078-6891.2011.021
Bartel, D. P. (2009). MicroRNAs: Target Recognition and Regulatory Functions. Cell, 136(2),
215–233. https://doi.org/10.1016/j.cell.2009.01.002
Becker, A. E., Hernandez, Y. G., Frucht, H., & Lucas, A. L. (2014). Pancreatic ductal
adenocarcinoma: Risk factors, screening, and early detection. World Journal of
104
Gastroenterology, 20(32), 11182–11198. https://doi.org/10.3748/wjg.v20.i32.11182
Benson, A. B., Venook, A. P., Al-Hawary, M. M., Cederquist, L., Chen, Y.-J., Ciombor, K.
K., … Freedman-Cass, D. A. (2018). NCCN Guidelines Insights: Colon Cancer, Version
2.2018. Journal of the National Comprehensive Cancer Network, 16(4), 359–369.
https://doi.org/10.6004/jnccn.2018.0021
Blagus, R., Lusa, L., Bishop, C., He, H., Garcia, E., Daskalaki, S., … Klaar, S. (2013). SMOTE
for high-dimensional class-imbalanced data. BMC Bioinformatics, 14(1), 106.
https://doi.org/10.1186/1471-2105-14-106
Bloomston, M., Frankel, W. L., Petrocca, F., Volinia, S., Alder, H., Hagan, J. P., … Croce, C. M.
(2007). MicroRNA Expression Patterns to Differentiate Pancreatic Adenocarcinoma From
Normal Pancreas and Chronic Pancreatitis. JAMA, 297(17), 1901.
https://doi.org/10.1001/jama.297.17.1901
Bootcov, M. R., Bauskin, A. R., Valenzuela, S. M., Moore, A. G., Bansal, M., He, X. Y., …
Breit, S. N. (1997). MIC-1, a novel macrophage inhibitory cytokine, is a divergent member
of the TGF- superfamily. Proceedings of the National Academy of Sciences, 94(21),
11514–11519. https://doi.org/10.1073/pnas.94.21.11514
Bosetti, C., Lucenteforte, E., Silverman, D. T., Petersen, G., Bracci, P. M., Ji, B. T., … La
Vecchia, C. (2012). Cigarette smoking and pancreatic cancer: an analysis from the
International Pancreatic Cancer Case-Control Consortium (Panc4). Annals of Oncology,
23(7), 1880–1888. https://doi.org/10.1093/annonc/mdr541
Breiman, L. (2001). Random Forests, 1–33.
Breiman, L., & Leo. (1996). Stacked regressions. Machine Learning, 24(1), 49–64.
https://doi.org/10.1023/A:1018046112532
Buckhaults, P., Rago, C., Vogelstein, B., St. Croix, B., Romans, K. E., Saha, S., … Kinzler, K.
W. (2001). Secreted and cell surface genes expressed in benign and malignant colorectal
tumors. Cancer Research, 61(19), 6996–7001.
Bünger, S., Laubert, T., Roblick, U. J., & Habermann, J. K. (2011). Serum biomarkers for
improved diagnostic of pancreatic cancer: a current overview. Journal of Cancer Research
and Clinical Oncology, 137(3), 375–389. https://doi.org/10.1007/s00432-010-0965-x
Capello, M., Lee, M., Wang, H., Babel, I., Katz, M. H., Fleming, J. B., … Hanash, S. M. (2015).
Carboxylesterase 2 as a Determinant of Response to Irinotecan and Neoadjuvant
105
FOLFIRINOX Therapy in Pancreatic Ductal Adenocarcinoma. JNCI: Journal of the
National Cancer Institute, 107(8). https://doi.org/10.1093/jnci/djv132
Caponi, S., Funel, N., Frampton, A. E., Mosca, F., Santarpia, L., Van der Velde, A. G., …
Giovannetti, E. (2013). The good, the bad and the ugly: a tale of miR-101, miR-21 and
miR-155 in pancreatic intraductal papillary mucinous neoplasms. Annals of Oncology,
24(3), 734–741. https://doi.org/10.1093/annonc/mds513
Chandrakumar, T., & Kathirvel, R. (n.d.). Classifying Diabetic Retinopathy using Deep Learning
Architecture. Retrieved from www.ijert.org
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
https://doi.org/10.1613/jair.953
Chen, S.-Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for multiple
comparisons. Journal of Thoracic Disease, 9(6), 1725–1729.
https://doi.org/10.21037/jtd.2017.05.34
Chen, X., Oshima, K., Schott, D., Wu, H., Hall, W., Song, Y., … Li, X. A. (2017). Assessment
of treatment response during chemoradiation therapy for pancreatic cancer based on
quantitative radiomic analysis of daily CTs : An exploratory study. PLoS ONE, 12(6), 1–14.
https://doi.org/10.1371/journal.pone.0178961
Chen, Y.-Z., Liu, D., Zhao, Y.-X., Wang, H.-T., Gao, Y., & Chen, Y. (2014). Diagnostic
Performance of Serum Macrophage Inhibitory Cytokine-1 in Pancreatic Cancer: A Meta-
Analysis and Meta-Regression Analysis. DNA and Cell Biology, 33(6), 370–377.
https://doi.org/10.1089/dna.2013.2237
Ching, T., Zhu, X., & Garmire, L. X. (2018). Cox-nnet: An artificial neural network method for
prognosis prediction of high-throughput omics data. PLoS Computational Biology, 14(4),
e1006076. https://doi.org/10.1371/journal.pcbi.1006076
Cho, J., Lee, K., Shin, E., Choy, G., & Do, S. (2016). HOW MUCH DATA IS NEEDED TO
TRAIN A MEDICAL IMAGE DEEP LEARNING SYSTEM TO ACHIEVE NECES-SARY
HIGH ACCURACY? Retrieved from https://arxiv.org/pdf/1511.06348.pdf
Chollet, F., & Others, A. (2015). Keras. Retrieved from https://keras.io
Christ, P. F., Elshaer, M. E. A., Ettlinger, F., Tatavarty, S., Bickel, M., Bilic, P., … Menze, B. H.
(2016). Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully
106
Convolutional Neural Networks and 3D Conditional Random Fields (pp. 415–423).
Springer, Cham. https://doi.org/10.1007/978-3-319-46723-8_48
Christ, P. F., Ettlinger, F., Grün, F., Ezzeldin, M., Elshaer, A., Lipková, J., … Menze, B. (n.d.).
Automatic Liver and Tumor Segmentation of CT and MRI Volumes Using Cascaded Fully
Convolutional Neural Networks. Retrieved from https://arxiv.org/pdf/1702.05970.pdf
Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, and E. Y. C. (n.d.).
Transfer Representation Learning for Medical Image Analysis. Retrieved from
http://infolab.stanford.edu/~echang/HTC_OM_Final.pdf
Cires¸ancires¸an, D. C., Giusti, A., Gambardella, L. M., & Urgen Schmidhuber, J. ¨. (n.d.). Deep
Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. Retrieved
from http://www.idsia.ch/
Clark, T., Zhang, J., Baig, S., Wong, A., Haider, M. A., & Khalvati, F. (2017). Fully automated
segmentation of prostate whole gland and transition zone in diffusion-weighted MRI using
convolutional neural networks. Journal of Medical Imaging, 4(04), 1.
https://doi.org/10.1117/1.JMI.4.4.041307
Conroy, T., Desseigne, F., Ychou, M., Bouché, O., Guimbaud, R., Bécouarn, Y., … Ducreux, M.
(2011). FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer. New England
Journal of Medicine, 364(19), 1817–1825. https://doi.org/10.1056/NEJMoa1011923
Coroller, T. P., Grossmann, P., Hou, Y., Rios Velazquez, E., Leijenaar, R. T. H., Hermann,
G., … Aerts, H. J. W. L. (2015a). CT-based radiomic signature predicts distant metastasis in
lung adenocarcinoma. Radiotherapy and Oncology, 114(3), 345–350.
https://doi.org/10.1016/j.radonc.2015.02.015
Coroller, T. P., Grossmann, P., Hou, Y., Rios Velazquez, E., Leijenaar, R. T. H., Hermann,
G., … Aerts, H. J. W. L. (2015b). CT-based radiomic signature predicts distant metastasis
in lung adenocarcinoma. Radiotherapy and Oncology, 114(3), 345–350.
https://doi.org/10.1016/j.radonc.2015.02.015
Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society.
Series B (Methodological). WileyRoyal Statistical Society. https://doi.org/10.2307/2985181
Cozzi, L., Comito, T., Fogliata, A., Franzese, C., Franceschini, D., Bonifacio, C., … Scorsetti,
M. (2019). Computed tomography based radiomic signature as predictive of survival and
local control after stereotactic body radiation therapy in pancreatic carcinoma. PLOS ONE,
107
14(1), e0210758. https://doi.org/10.1371/journal.pone.0210758
Cui, Y., Song, J., Pollom, E., Alagappan, M., Shirato, H., Chang, D. T., … Li, R. (2016).
Quantitative Analysis of 18F-Fluorodeoxyglucose Positron Emission Tomography
Identifies Novel Prognostic Imaging Biomarkers in Locally Advanced Pancreatic Cancer
Patients Treated With Stereotactic Body Radiation Therapy. International Journal of
Radiation Oncology*Biology*Physics, 96(1), 102–109.
https://doi.org/10.1016/j.ijrobp.2016.04.034
Cunliffe, A., Armato, S. G., Castillo, R., Pham, N., Guerrero, T., & Al-Hallaq, H. A. (2015).
Lung Texture in Serial Thoracic Computed Tomography Scans: Correlation of Radiomics-
based Features With Radiation Therapy Dose and Radiation Pneumonitis Development.
International Journal of Radiation Oncology*Biology*Physics, 91(5), 1048–1056.
https://doi.org/10.1016/j.ijrobp.2014.11.030
De Fauw, J., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., …
Ronneberger, O. (2018). Clinically applicable deep learning for diagnosis and referral in
retinal disease. Nature Medicine, 24(9), 1342–1350. https://doi.org/10.1038/s41591-018-
0107-6
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two
or more correlated receiver operating characteristic curves: a nonparametric approach.
Biometrics, 44(3), 837–845. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3203132
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale
hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848
Dietterich, T. G. (2000). Ensemble Methods in Machine Learning (pp. 1–15). Springer, Berlin,
Heidelberg. https://doi.org/10.1007/3-540-45014-9_1
Dillhoff, M., Liu, J., Frankel, W., Croce, C., & Bloomston, M. (2008). MicroRNA-21 is
Overexpressed in Pancreatic Cancer and a Potential Predictor of Survival. Journal of
Gastrointestinal Surgery, 12(12), 2171–2176. https://doi.org/10.1007/s11605-008-0584-x
Dů, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R., & Singh, A. (n.d.). How
Many Samples are Needed to Estimate a Convolutional Neural Network? Retrieved from
https://papers.nips.cc/paper/7320-how-many-samples-are-needed-to-estimate-a-
convolutional-neural-network.pdf
108
Eibl, A. S. and G. (2015). Pancreatic Ductal Adenocarcinoma. Pancreapedia: The Exocrine
Pancreas Knowledge Base. https://doi.org/10.3998/PANC.2015.14
Eilaghi, A., Baig, S., Zhang, Y., Zhang, J., Karanicolas, P., Gallinger, S., … Haider, M. A.
(2017). CT texture features are associated with overall survival in pancreatic ductal
adenocarcinoma – a quantitative analysis, 1–7. https://doi.org/10.1186/s12880-017-0209-5
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017).
Dermatologist-level classification of skin cancer with deep neural networks. Nature,
542(7639), 115–118. https://doi.org/10.1038/nature21056
Etymologia: Bonferroni correction. (2015). Emerging Infectious Diseases, 21(2), 289.
https://doi.org/10.3201/EID2102.ET2102
Farrell, J. J., Elsaleh, H., Garcia, M., Lai, R., Ammar, A., Regine, W. F., … Mackey, J. R.
(2009). Human Equilibrative Nucleoside Transporter 1 Levels Predict Response to
Gemcitabine in Patients With Pancreatic Cancer. Gastroenterology, 136(1), 187–195.
https://doi.org/10.1053/j.gastro.2008.09.067
Fawcett, T. (2005). An introduction to ROC analysis.
https://doi.org/10.1016/j.patrec.2005.10.010
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., & Amorim Fernández-Delgado,
D. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification
Problems? Journal of Machine Learning Research, 15, 3133–3181.
Ferrone, C. R., Pieretti-Vanmarcke, R., Bloom, J. P., Zheng, H., Szymonifka, J., Wargo, J. A., …
Warshaw, A. L. (2012). Pancreatic ductal adenocarcinoma: long-term survival does not
equal cure. Surgery, 152(3 Suppl 1), S43-9. https://doi.org/10.1016/j.surg.2012.05.020
FOLFIRINOX versus Gemcitabine for Metastatic Pancreatic Cancer. (2011). New England
Journal of Medicine, 365(8), 768–769. https://doi.org/10.1056/NEJMc1107627
Foucher, E. D., Ghigo, C., Chouaib, S., Galon, J., Iovanna, J., & Olive, D. (2018). Pancreatic
Ductal Adenocarcinoma: A Strong Imbalance of Good and Bad Immunological Cops in the
Tumor Microenvironment. Frontiers in Immunology, 9, 1044.
https://doi.org/10.3389/fimmu.2018.01044
Fox, J., & Weisberg, S. (2011). Cox Proportional-Hazards Regression for Survival Data in R.
Retrieved from
https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Cox-
109
Regression.pdf
Fujita, H., Ohuchida, K., Mizumoto, K., Itaba, S., Ito, T., Nakata, K., … Tanaka, M. (2010).
Gene Expression Levels as Predictive Markers of Outcome in Pancreatic Cancer after
Gemcitabine-Based Adjuvant Chemotherapy. Neoplasia, 12(10), 807-IN8.
https://doi.org/10.1593/neo.10458
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–
202. https://doi.org/10.1007/BF00344251
Gabryś, H. S., Buettner, F., Sterzing, F., Hauswald, H., & Bangert, M. (2018). Design and
Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal
Tissue Complication Probability Modeling of Xerostomia. Frontiers in Oncology, 8.
https://doi.org/10.3389/fonc.2018.00035
Ganeshan, B., Abaleke, S., Young, R. C. D., Chatwin, C. R., & Miles, K. a. (2010). Texture
analysis of non-small cell lung cancer on unenhanced computed tomography: Initial
evidence for a relationship with tumour glucose metabolism and stage. Cancer Imaging,
10(1), 137–143. https://doi.org/10.1102/1470-7330.2010.0021
Ganeshan, B., Panayiotou, E., Burnand, K., Dizdarevic, S., & Miles, K. (2012). Tumour
heterogeneity in non-small cell lung carcinoma assessed by CT texture analysis: A potential
marker of survival. European Radiology, 22(4), 796–802. https://doi.org/10.1007/s00330-
011-2319-8
Gao, X., Lin, S., & Wong, T. Y. (2015). Automatic Feature Learning to Grade Nuclear Cataracts
Based on Deep Learning. IEEE Transactions on Biomedical Engineering, 62(11), 2693–
2701. https://doi.org/10.1109/TBME.2015.2444389
Gensheimer, M. F., & Narasimhan, B. A Scalable Discrete-Time Survival Model for Neural
Networks. Retrieved from http://github.com/MGensheimer/nnet-survival.
George, B., Seals, S., & Aban, I. (2014). Survival analysis and regression models. Journal of
Nuclear Cardiology : Official Publication of the American Society of Nuclear Cardiology,
21(4), 686–694. https://doi.org/10.1007/s12350-014-9908-2
George, D., Shen, H., & Huerta, E. A. (2017). Deep Transfer Learning: A new deep learning
glitch classification method for advanced LIGO. Retrieved from
http://arxiv.org/abs/1706.07446
110
Gillies, R. J., Kinahan, P. E., & Hricak, H. (2015). Radiomics: Images Are More than Pictures,
They Are Data. Radiology, 278(2), 151169. https://doi.org/10.1148/radiol.2015151169
Gold, D. V., Karanjawala, Z., Modrak, D. E., Goldenberg, D. M., & Hruban, R. H. (2007).
PAM4-Reactive MUC1 Is a Biomarker for Early Pancreatic Adenocarcinoma. Clinical
Cancer Research, 13(24), 7380–7387. https://doi.org/10.1158/1078-0432.CCR-07-1488
Gold, David V., Gaedcke, J., Ghadimi, B. M., Goggins, M., Hruban, R. H., Liu, M., …
Goldenberg, D. M. (2013). PAM4 enzyme immunoassay alone and in combination with CA
19-9 for the detection of pancreatic adenocarcinoma. Cancer, 119(3), 522–528.
https://doi.org/10.1002/cncr.27762
Gold, David V., Lew, K., Maliniak, R., Hernandez, M., & Cardillo, T. (1994). Characterization
of monoclonal antibody PAM4 reactive with a pancreatic cancer mucin. International
Journal of Cancer, 57(2), 204–210. https://doi.org/10.1002/ijc.2910570213
Goonetilleke, K. S., & Siriwardena, A. K. (2007). Systematic review of carbohydrate antigen
(CA 19-9) as a biochemical marker in the diagnosis of pancreatic cancer. European Journal
of Surgical Oncology (EJSO), 33(3), 266–270. https://doi.org/10.1016/j.ejso.2006.10.004
Gourgou-Bourgade, S., Bascoul-Mollevi, C., Desseigne, F., Ychou, M., Bouché, O., Guimbaud,
R., … Conroy, T. (2013). Impact of FOLFIRINOX Compared With Gemcitabine on
Quality of Life in Patients With Metastatic Pancreatic Cancer: Results From the PRODIGE
4/ACCORD 11 Randomized Trial. Journal of Clinical Oncology, 31(1), 23–29.
https://doi.org/10.1200/JCO.2012.44.4869
Greenhalf, W., Ghaneh, P., Neoptolemos, J. P., Palmer, D. H., Cox, T. F., Lamb, R. F., …
Büchler, M. W. (2014). Pancreatic Cancer hENT1 Expression and Survival From
Gemcitabine in Patients From the ESPAC-3 Trial. JNCI: Journal of the National Cancer
Institute, 106(1). https://doi.org/10.1093/jnci/djt347
Gu Kim, H., Choi, Y., & Man Ro, Y. (n.d.). Modality-bridge Transfer Learning for Medical
Image Classification. Retrieved from https://arxiv.org/ftp/arxiv/papers/1708/1708.03111.pdf
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., … Webster, D.
R. (2016). Development and Validation of a Deep Learning Algorithm for Detection of
Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22), 2402.
https://doi.org/10.1001/jama.2016.17216
Haider, M. A., Vosough, A., Khalvati, F., Kiss, A., Ganeshan, B., & Bjarnason, G. A. (2017). CT
111
texture analysis: a potential tool for prediction of survival in patients with metastatic clear
cell carcinoma treated with sunitinib. Cancer Imaging, 17(1).
https://doi.org/10.1186/s40644-017-0106-8
Hambarde, P., Talbar, S. N., Sable, N., Mahajan, A., Chavan, S. S., & Thakur, M. (2019).
Radiomics for peripheral zone and intra-prostatic urethra segmentation in MR imaging.
Biomedical Signal Processing and Control, 51, 19–29.
https://doi.org/10.1016/J.BSPC.2019.01.024
Hawkins, S., Wang, H., Liu, Y., Garcia, A., Stringfield, O., Krewer, H., … Gillies, R. J. (2016).
Predicting Malignant Nodules from Screening CT Scans. Journal of Thoracic Oncology,
11(12), 2120–2128. https://doi.org/10.1016/j.jtho.2016.07.002
He, K., Girshick, R., & Dollár, P. (2018). Rethinking ImageNet Pre-training. Retrieved from
https://arxiv.org/abs/1811.08883
He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition (2015).
Retrieved from http://arxiv.org/abs/1512.03385
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
https://doi.org/10.3389/fpsyg.2013.00124
Hertel, L., Barth, E., Käster, T., & Martinetz, T. (2017). Deep Convolutional Neural Networks as
Generic Feature Extractors. Retrieved from http://arxiv.org/abs/1710.02286
Hong, T. H., & Park, I. Y. (2014). MicroRNA expression profiling of diagnostic needle aspirates
from surgical pancreatic cancer specimens. Annals of Surgical Treatment and Research,
87(6), 290. https://doi.org/10.4174/astr.2014.87.6.290
Horn, Z. C., Auret, L., McCoy, J. T., Aldrich, C., & Herbst, B. M. (2017). Performance of
Convolutional Neural Networks for Feature Extraction in Froth Flotation Sensing. IFAC-
PapersOnLine, 50(2), 13–18. https://doi.org/10.1016/J.IFACOL.2017.12.003
Horvat, N., Veeraraghavan, H., Khan, M., Blazic, I., Zheng, J., Capanu, M., … Petkovska, I.
(2018). MR Imaging of Rectal Cancer: Radiomics Analysis to Assess Treatment Response
after Neoadjuvant Therapy. Radiology, 287(3), 833–843.
https://doi.org/10.1148/radiol.2018172300
Hruban, R. H., Canto, M. I., Goggins, M., Schulick, R., & Klein, A. P. Update on familial
pancreatic cancer, 44 Advances in Surgery § (2010).
112
https://doi.org/10.1016/j.yasu.2010.05.011
Huang, Y.-Q., Liang, C.-H., He, L., Tian, J., Liang, C.-S., Chen, X., … Liu, Z.-Y. (2016).
Development and Validation of a Radiomics Nomogram for Preoperative Prediction of
Lymph Node Metastasis in Colorectal Cancer. Journal of Clinical Oncology : Official
Journal of the American Society of Clinical Oncology, 34(18), 2157–2164.
https://doi.org/10.1200/JCO.2015.65.9128
Huang, Y., Liu, Z., He, L., Chen, X., Pan, D., Ma, Z., … Liang, C. (2016). Radiomics Signature:
A Potential Biomarker for the Prediction of Disease-Free Survival in Early-Stage (I or II)
Non-Small Cell Lung Cancer. Radiology, 152234.
https://doi.org/10.1148/radiol.2016152234
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey
striate cortex. The Journal of Physiology, 195(1), 215–243.
https://doi.org/10.1113/jphysiol.1968.sp008455
Huxley, R., Ansary-Moghaddam, A., Berrington de González, A., Barzi, F., & Woodward, M.
(2005). Type-II diabetes and pancreatic cancer: A meta-analysis of 36 studies. British
Journal of Cancer, 92(11), 2076–2083. https://doi.org/10.1038/sj.bjc.6602619
Huynh, E., Coroller, T. P., Narayan, V., Agrawal, V., Hou, Y., Romano, J., … Aerts, H. J. W. L.
(2016). CT-based radiomic analysis of stereotactic body radiation therapy patients with lung
cancer. Radiotherapy and Oncology, 120(2), 258–266.
https://doi.org/10.1016/j.radonc.2016.05.024
Ilic, M., & Ilic, I. (2016). Epidemiology of pancreatic cancer. World Journal of
Gastroenterology, 22(44), 9694–9705. https://doi.org/10.3748/wjg.v22.i44.9694
Infante, J. R., Matsubayashi, H., Sato, N., Tonascia, J., Klein, A. P., Riall, T. A., … Goggins, M.
(2007). Peritumoral Fibroblast SPARC Expression and Patient Outcome With Resectable
Pancreatic Adenocarcinoma. Journal of Clinical Oncology, 25(3), 319–325.
https://doi.org/10.1200/JCO.2006.07.8824
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., … Ng, A. Y. (n.d.).
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert
Comparison. Retrieved from www.aaai.org
Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., & Maier-Hein, K. H. (n.d.). Brain
Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017
113
Challenge. Retrieved from https://arxiv.org/pdf/1802.10508.pdf
Itakura, H., Achrol, A. S., Mitchell, L. A., Loya, J. J., Liu, T., Westbroek, E. M., … Gevaert, O.
(2015). Magnetic resonance image features identify glioblastoma phenotypic subtypes with
distinct molecular pathway activities. Science Translational Medicine, 7(303), 303ra138-
303ra138. https://doi.org/10.1126/scitranslmed.aaa7582
Jiang, X.-T., Tao, H.-Q., & Zou, S.-C. (2004). Detection of serum tumor markers in the
diagnosis and treatment of patients with pancreatic cancer. Hepatobiliary and Pancreatic
Diseases International, 3(3), 464–468.
Junfeng, D., & Yunyang, Y. (2012). The Fast Medical Image Segmentation of Target Region
Based on Improved FM Algorithm. Procedia Engineering, 29, 48–52.
https://doi.org/10.1016/J.PROENG.2011.12.666
Kamisawa, T., Wood, L. D., Itoi, T., & Takaori, K. (2016). Pancreatic cancer. The Lancet,
388(10039), 73–85. https://doi.org/10.1016/S0140-6736(16)00141-0
Kattan, M. W., Hess, K. R., & Beck, J. R. (1998). Experiments to Determine Whether Recursive
Partitioning (CART) or an Artificial Neural Network Overcomes Theoretical Limitations of
Cox Proportional Hazards Regression. Computers and Biomedical Research, 31(5), 363–
373. https://doi.org/10.1006/CBMR.1998.1488
Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., & Kluger, Y. (2016). DeepSurv:
Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep
Neural Network. https://doi.org/10.1186/s12874-018-0482-1
Kaur, R., & Kaur, J. (2014). Current Methods in Medical Image Segmentation: A Review.
International Conference on Computer Communication and Systems - ICCCS 2014.
Keek, S. A., Leijenaar, R. T., Jochems, A., & Woodruff, H. C. (2018). A review on radiomics
and the future of theranostics for patient selection in precision medicine. The British
Journal of Radiology, 91(1091), 20170926. https://doi.org/10.1259/bjr.20170926
Khalvati, F., Zhang, Y., Baig, S., Lobo-Mueller, E. M., Karanicolas, P., Gallinger, S., & Haider,
M. A. (2019). Prognostic Value of CT Radiomic Features in Resectable Pancreatic Ductal
Adenocarcinoma. Scientific Reports, 9(1), 5449. https://doi.org/10.1038/s41598-019-41728-
7
Khalvati, F., Zhang, Y., Wong, A., & Haider, M. A. (2019). Radiomics. In Encyclopedia of
Biomedical Engineering (Vol. 2, pp. 597–603). https://doi.org/10.1016/B978-0-12-801238-
114
3.99964-1
Kickingereder, P., Neuberger, U., Bonekamp, D., Piechotta, P. L., Götz, M., Wick, A., …
Bendszus, M. (2018). Radiomic subtyping improves disease stratification beyond key
molecular, clinical, and standard imaging characteristics in patients with glioblastoma.
Neuro-Oncology, 20(6), 848–857. https://doi.org/10.1093/neuonc/nox188
Kim, E., Corte-Real, M., & Baloch, Z. (2016). A deep semantic mobile application for thyroid
cytopathology. In Jianguo Zhang & T. S. Cook (Eds.) (Vol. 9789, p. 97890A). International
Society for Optics and Photonics. https://doi.org/10.1117/12.2216468
Kishikawa, T. (2015). Circulating RNAs as new biomarkers for detecting pancreatic cancer.
World Journal of Gastroenterology, 21(28), 8527. https://doi.org/10.3748/wjg.v21.i28.8527
Klawikowski, S., Christian, J., Schott, D., Zhang, M., & Li, X. (2016). Development of a CT-
Radiomics Based Early Response Prediction Model During Delivery of Chemoradiation
Therapy for Pancreatic Cancer. Medical Physics, 43(6), 3350–3350.
https://doi.org/10.1118/1.4955675
Kooi, T., Litjens, G., van Ginneken, B., Gubern-Mérida, A., Sánchez, C. I., Mann, R., …
Karssemeijer, N. (2017). Large scale deep learning for computer aided detection of
mammographic lesions. Medical Image Analysis, 35, 303–312. Retrieved from
https://linkinghub.elsevier.com/retrieve/pii/S1361841516301244
Koopmann, J. (2006). Serum Markers in Patients with Resectable Pancreatic Adenocarcinoma:
Macrophage Inhibitory Cytokine 1 versus CA19-9. Clinical Cancer Research, 12(2), 442–
446. https://doi.org/10.1158/1078-0432.CCR-05-0564
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Classification with Deep Convolutional
Neural Networks. Retrieved from http://code.google.com/p/cuda-convnet/
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal Of
Statistical Software, 28(5), 1–26. https://doi.org/10.1053/j.sodo.2009.03.002
Kumar, D., Shafiee, M. J., Chung, A. G., Khalvati, F., Haider, M. A., & Wong, A. (2015).
Discovery Radiomics for Pathologically-Proven Computed Tomography Lung Cancer
Prediction. Retrieved from http://arxiv.org/abs/1509.00117
Kumar, V., Gu, Y., Basu, S., Berglund, A., Eschrich, S. A., Schabath, M. B., … Gillies, R. J.
(2013). Radiomics: The Process and the Challenges, 30(9), 1234–1248.
https://doi.org/10.1016/j.mri.2012.06.010.QIN
115
Kung, J. T. Y., Colognori, D., & Lee, J. T. (2013). Long Noncoding RNAs: Past, Present, and
Future. Genetics, 193(3), 651–669. https://doi.org/10.1534/genetics.112.146704
Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of
Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036.i11
Lagos-Quintana, M. (2001). Identification of Novel Genes Coding for Small Expressed RNAs.
Science, 294(5543), 853–858. https://doi.org/10.1126/science.1064921
Lai, Z., & Deng, H. (2018). Medical Image Classification Based on Deep Features Extracted by
Deep Model and Statistic Feature Fusion with Multilayer Perceptron. Computational
Intelligence and Neuroscience, 2018, 2061516. https://doi.org/10.1155/2018/2061516
Lakhani, P., & Sundaram, B. (2017). Deep Learning at Chest Radiography: Automated
Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks.
Radiology, 284(2), 574–582. https://doi.org/10.1148/radiol.2017162326
Lambin, P., Leijenaar, R. T. H., Deist, T. M., Peerlings, J., de Jong, E. E. C., van Timmeren,
J., … Walsh, S. (2017a). Radiomics: the bridge between medical imaging and personalized
medicine. Nature Reviews Clinical Oncology, 14(12), 749–762.
https://doi.org/10.1038/nrclinonc.2017.141
Lambin, P., Leijenaar, R. T. H., Deist, T. M., Peerlings, J., de Jong, E. E. C., van Timmeren,
J., … Walsh, S. (2017b). Radiomics: the bridge between medical imaging and personalized
medicine. Nature Reviews Clinical Oncology, 14(12), 749–762.
https://doi.org/10.1038/nrclinonc.2017.141
Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., van Stiphout, R. G. P. M., Granton,
P., … Aerts, H. J. W. L. (2012). Radiomics: Extracting more information from medical
images using advanced feature analysis. European Journal of Cancer, 48(4), 441–446.
https://doi.org/10.1016/j.ejca.2011.11.036
Lao, J., Chen, Y., Li, Z.-C., Li, Q., Zhang, J., Liu, J., & Zhai, G. (2017). A Deep Learning-Based
Radiomics Model for Prediction of Survival in Glioblastoma Multiforme. Scientific Reports,
7(1), 10353. https://doi.org/10.1038/s41598-017-10649-8
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel,
L. D. (1990). Handwritten Digit Recognition with a Back-Propagation Network. Retrieved
116
from https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-
propagation-network
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
https://doi.org/10.1109/5.726791
Leo, C. S., Lim, C. C. T., & Suneetha, V. (2009). An Automated Segmentation Algorithm for
Medical Images. In 13th International Conference on Biomedical Engineering (pp. 109–
111). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-
92841-6_27
Li, H., Zhu, Y., Burnside, E. S., Drukker, K., Hoadley, K. A., Fan, C., … Giger, M. L. (2016).
MR Imaging Radiomics Signatures for Predicting the Risk of Breast Cancer Recurrence as
Given by Research Versions of MammaPrint, Oncotype DX, and PAM50 Gene Assays.
Radiology, 281(2), 382–391. https://doi.org/10.1148/radiol.2016152110
Li, Yiming, Liu, X., Qian, Z., Sun, Z., Xu, K., Wang, K., … Jiang, T. (2018). Genotype
prediction of ATRX mutation in lower-grade gliomas using an MRI radiomics signature.
European Radiology, 28(7), 2960–2968. https://doi.org/10.1007/s00330-017-5267-0
Li, Yiming, Qian, Z., Xu, K., Wang, K., Fan, X., Li, S., … Wang, Y. (2018). MRI features
predict p53 status in lower-grade gliomas via a machine-learning approach. NeuroImage:
Clinical, 17, 306–311. https://doi.org/10.1016/j.nicl.2017.10.030
Li, Yuexiang, Shen, L., Li, Y., & Shen, L. (2018). Skin Lesion Analysis towards Melanoma
Detection Using Deep Learning Network. Sensors, 18(2), 556.
https://doi.org/10.3390/s18020556
Liao, Q., Zhao, Y.-P., Yang, Y.-C., Li, L.-J., Long, X., & Han, S.-M. (2007). Combined
detection of serum tumor markers for differential diagnosis of solid lesions located at the
pancreatic head. Hepatobiliary and Pancreatic Diseases International, 6(6), 641–645.
Link, A., Becker, V., Goel, A., Wex, T., & Malfertheiner, P. (2012). Feasibility of Fecal
MicroRNAs as Novel Biomarkers for Pancreatic Cancer. PLoS ONE, 7(8), e42933.
https://doi.org/10.1371/journal.pone.0042933
Litjens, G., Kooi, T., Bejnordi, B. E., Arindra, A., Setio, A., Ciompi, F., … Sánchez, C. I.
(2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42,
60–88. https://doi.org/10.1016/j.media.2017.07.005
117
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., … Sánchez,
C. I. (2017). A Survey on Deep Learning in Medical Image Analysis.
https://doi.org/10.1016/j.media.2017.07.005
Liu, B., Wei, Y., Zhang, Y., Yang, Q., & Kong, H. (2017). Deep Neural Networks for High
Dimension, Low Sample Size Data. Retrieved from
https://www.ijcai.org/proceedings/2017/0318.pdf
Liu, D., Chang, C.-H., Gold, D. V., & Goldenberg, D. M. (2015). Identification of PAM4
(clivatuzumab)-reactive epitope on MUC5AC: A promising biomarker and therapeutic
target for pancreatic cancer. Oncotarget, 6(6). https://doi.org/10.18632/oncotarget.2760
Liu, H., Li, B., Lv, X., & Huang, Y. (2017). Image Retrieval Using Fused Deep Convolutional
Features. Procedia Computer Science, 107(Icict), 749–754.
https://doi.org/10.1016/j.procs.2017.03.159
Liu, Y., Balagurunathan, Y., Atwater, T., Antic, S., Li, Q., Walker, R. C., … Gillies, R. J.
(2017). Radiological Image Traits Predictive of Cancer Status in Pulmonary Nodules.
Clinical Cancer Research, 23(6), 1442–1449. https://doi.org/10.1158/1078-0432.CCR-15-
3102
Lo, S.-C. B., Lou, S.-L. A., Jyh-Shyan Lin, J. S., Freedman, M. T., Chien, M. V., & Mun, S. K.
(1995). Artificial convolution neural network techniques and applications for lung nodule
detection. IEEE Transactions on Medical Imaging, 14(4), 711–718.
https://doi.org/10.1109/42.476112
Loosen, S. H., Neumann, U. P., Trautwein, C., Roderburg, C., & Luedde, T. (2017). Current and
future biomarkers for pancreatic adenocarcinoma. Tumor Biology, 39(6),
101042831769223. https://doi.org/10.1177/1010428317692231
Louvet, C., Labianca, R., Hammel, P., Lledo, G., Zampino, M. G., André, T., … de Gramont, A.
(2005). Gemcitabine in Combination With Oxaliplatin Compared With Gemcitabine Alone
in Locally Advanced or Metastatic Pancreatic Cancer: Results of a GERCOR and GISCAD
Phase III Trial. Journal of Clinical Oncology, 23(15), 3509–3516.
https://doi.org/10.1200/JCO.2005.06.023
Luo, G., Jin, K., Guo, M., Cheng, H., Liu, Z., Xiao, Z., … Yu, X. (2017). Patients with normal-
range CA19-9 levels represent a distinct subgroup of pancreatic cancer patients. Oncology
Letters, 13(2), 881. https://doi.org/10.3892/OL.2016.5501
118
Luo, J., Xiao, L., Wu, C., Zheng, Y., & Zhao, N. (2013). The Incidence and Survival Rate of
Population-Based Pancreatic Cancer Patients: Shanghai Cancer Registry 2004-2009. PLoS
ONE, 8(10), e76052. https://doi.org/10.1371/journal.pone.0076052
Lynch, S. M., Vrieling, A., Lubin, J. H., Kraft, P., Mendelsohn, J. B., Hartge, P., … Stolzenberg-
Solomon, R. Z. (2009). Cigarette Smoking and Pancreatic Cancer: A Pooled Analysis From
the Pancreatic Cancer Cohort Consortium. American Journal of Epidemiology, 170(4), 403–
413. https://doi.org/10.1093/aje/kwp134
Maas, M., Nelemans, P. J., Valentini, V., Das, P., Rödel, C., Kuo, L.-J., … Beets, G. L. (2010).
Long-term outcome in patients with a pathological complete response after chemoradiation
for rectal cancer: a pooled analysis of individual patient data. The Lancet Oncology, 11(9),
835–844. https://doi.org/10.1016/S1470-2045(10)70172-8
Mangai, U., Samanta, S., Das, S., & Chowdhury, P. (2010). A Survey of Decision Fusion and
Feature Fusion Strategies for Pattern Classification. IETE Technical Review, 27(4), 293.
https://doi.org/10.4103/0256-4602.64604
Marechal, R., Mackey, J. R., Lai, R., Demetter, P., Peeters, M., Polus, M., … Van Laethem, J.-L.
(2009). Human Equilibrative Nucleoside Transporter 1 and Human Concentrative
Nucleoside Transporter 3 Predict Survival after Adjuvant Gemcitabine Therapy in Resected
Pancreatic Adenocarcinoma. Clinical Cancer Research, 15(8), 2913–2919.
https://doi.org/10.1158/1078-0432.CCR-08-2080
Mariani, L., Coradini, D., Biganzoli, E., Boracchi, P., Marubini, E., Pilotti, S., … Rilke, F.
(1997). Prognostic factors for metachronous contralateral breast cancer: a comparison of the
linear Cox regression model and its artificial neural network extension. Breast Cancer
Research and Treatment, 44(2), 167–178. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/9232275
Matthias Gamer, A., & Matthias Gamer, M. (2015). Package “irr.” Retrieved from
http://www.r-project.org
Mazurowski, M. A. (2015). Radiogenomics: What It Is and Why It Is Important. Journal of the
American College of Radiology, 12(8), 862–866. https://doi.org/10.1016/j.jacr.2015.04.019
McGuigan, A., Kelly, P., Turkington, R. C., Jones, C., Coleman, H. G., & McCain, R. S. (2018).
Pancreatic cancer: A review of clinical diagnosis, epidemiology, treatment and outcomes.
World Journal of Gastroenterology, 24(43), 4846–4861.
119
https://doi.org/10.3748/wjg.v24.i43.4846
Memba, R., Duggan, S. N., Ni Chonchubhair, H. M., Griffin, O. M., Bashir, Y., O’Connor, D.
B., … Conlon, K. C. (2017). The potential role of gut microbiota in pancreatic disease: A
systematic review. Pancreatology, 17(6), 867–874.
https://doi.org/10.1016/j.pan.2017.09.002
Men, K., Dai, J., & Li, Y. (2017). Automatic segmentation of the clinical target volume and
organs at risk in the planning CT for rectal cancer using deep dilated convolutional neural
networks. Medical Physics, 44(12), 6377–6389. https://doi.org/10.1002/mp.12602
Menegola, A., Fornaciali, M., Pires, R., Avila, S., & Valle, E. (2016). Towards Automated
Melanoma Screening: Exploring Transfer Learning Schemes. Retrieved from
https://lasagne.readthedocs.io/en/latest/
Meng, Y., Zhang, Y., Dong, D., Li, C., Liang, X., Zhang, C., … Zhang, H. (2018). Novel
radiomic signature as a prognostic biomarker for locally advanced rectal cancer. Journal of
Magnetic Resonance Imaging, 48(3), 605–614. https://doi.org/10.1002/jmri.25968
Midha, S., Chawla, S., & Garg, P. K. (2016). Modifiable and non-modifiable risk factors for
pancreatic cancer: A review. Cancer Letters, 381(1), 269–277.
https://doi.org/10.1016/j.canlet.2016.07.022
Morin, O. (2018). A Deep Look Into the Future of Quantitative Imaging in Oncology: A
Statement of Working Principles and Proposal for Change. International Journal of
Radiation Oncology*Biology*Physics, 102(4), 1074–1082.
https://doi.org/10.1016/J.IJROBP.2018.08.032
Mukaka, M. M. (2012). Statistics corner: A guide to appropriate use of correlation coefficient in
medical research. Malawi Medical Journal : The Journal of Medical Association of Malawi,
24(3), 69–71. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/23638278
Nguyen, K., Haytmyradov, M., Mostafavi, H., Patel, R., Surucu, M., Block, A., … Roeske, J. C.
(2018). Evaluation of Radiomics to Predict the Accuracy of Markerless Motion Tracking of
Lung Tumors: A Preliminary Study. Frontiers in Oncology, 8, 292.
https://doi.org/10.3389/fonc.2018.00292
Nie, K., Shi, L., Chen, Q., Hu, X., Jabbour, S. K., Yue, N., … Sun, X. (2016). Rectal Cancer:
Assessment of Neoadjuvant Chemoradiation Outcome based on Radiomics of
Multiparametric MRI. Clinical Cancer Research, 22(21), 5256–5264.
120
https://doi.org/10.1158/1078-0432.CCR-15-2997
Nikolov, S., Blackwell, S., Mendes, R., De Fauw, J., Meyer, C., Hughes, C., … Ronneberger, O.
(2018). Deep learning to achieve clinically applicable segmentation of head and neck
anatomy for radiotherapy. Retrieved from http://arxiv.org/abs/1809.04430
Oda, M., Shimizu, N., Oda, H., Hayashi, Y., Kitasaka, T., Fujiwara, M., … Roth, H. R. (2018).
Towards dense volumetric pancreas segmentation in CT using 3D fully convolutional
networks. In E. D. Angelini & B. A. Landman (Eds.), Medical Imaging 2018: Image
Processing (Vol. 10574, p. 10). SPIE. https://doi.org/10.1117/12.2293499
Oda, M., Shimizu, N., Roth, H. R., Karasawa, ichi, Kitasaka, T., Misawa, K., … Mori, K.
(2018). 3D FCN Feature Driven Regression Forest-Based Pancreas Localization and
Segmentation. Retrieved from https://arxiv.org/pdf/1806.03019.pdf
Oikonomou, A., Khalvati, F., & et al. (2018). Radiomics analysis at PET/CT contributes to
prognosis of recurrence and survival in lung cancer treated with stereotactic body
radiotherapy. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-22357-y
Oken, M. M., Creech, R. H., Tormey, D. C., Horton, J., Davis, T. E., McFadden, E. T., &
Carbone, P. P. (1982). Toxicity and response criteria of the Eastern Cooperative Oncology
Group. American Journal of Clinical Oncology, 5(6), 649–655. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/7165009
Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., … Rueckert, D.
(n.d.). Attention U-Net: Learning Where to Look for the Pancreas. Retrieved from
https://arxiv.org/pdf/1804.03999.pdf
Owens, C. A., Peterson, C. B., Tang, C., Koay, E. J., Yu, W., Mackin, D. S., … Yang, J. (2018).
Lung tumor segmentation methods: Impact on the uncertainty of radiomics features for non-
small cell lung cancer. PloS One, 13(10), e0205003.
https://doi.org/10.1371/journal.pone.0205003
Pan, S. J., & Yang, Q. (2009). A Survey on Transfer Learning.
https://doi.org/10.1109/TKDE.2009.191
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191
Papp, L., Pötsch, N., Grahovac, M., Schmidbauer, V., Woehrer, A., Preusser, M., … Traub-
Weidinger, T. (2018). Glioma Survival Prediction with Combined Analysis of In Vivo 11 C-
121
MET PET Features, Ex Vivo Features, and Patient Features by Supervised Machine
Learning. Journal of Nuclear Medicine, 59(6), 892–899.
https://doi.org/10.2967/jnumed.117.202267
Parekh, V., & Jacobs, M. A. (2016). Radiomics: a new application from established techniques.
Expert Review of Precision Medicine and Drug Development, 1(2), 207–226.
https://doi.org/10.1080/23808993.2016.1164013
Parmar, C., Grossmann, P., Bussink, J., Lambin, P., & Aerts, H. J. W. L. (2015). Machine
Learning methods for Quantitative Radiomic Biomarkers. Scientific Reports, 5, 13087.
https://doi.org/10.1038/srep13087
Parmar, C., Leijenaar, R. T. H., Grossmann, P., Rios Velazquez, E., Bussink, J., Rietveld, D., …
Aerts, H. J. W. L. W. L. (2015). Radiomic feature clusters and Prognostic Signatures
specific for Lung and Head & Neck cancer. Sci. Rep., 5(1), 1–10.
https://doi.org/10.1038/srep11044\rhttp://www.nature.com/srep/2015/150605/srep11044/ab
s/srep11044.html#supplementary-information
Peixoto, R. D., Speers, C., McGahan, C. E., Renouf, D. J., Schaeffer, D. F., & Kennecke, H. F.
(2015). Prognostic factors and sites of metastasis in unresectable locally advanced
pancreatic cancer. Cancer Medicine, 4(8), 1171–1177. https://doi.org/10.1002/cam4.459
Pérez-Beteta, J., Molina-García, D., Ortiz-Alhambra, J. A., Fernández-Romero, A., Luque, B.,
Arregui, E., … Pérez-García, V. M. (2018). Tumor Surface Regularity at MR Imaging
Predicts Survival and Response to Surgery in Patients with Glioblastoma. Radiology,
288(1), 218–225. https://doi.org/10.1148/radiol.2018171051
Perkins, G. L., Slater, E. D., Sanders, G. K., & Prichard, J. G. (2003). Serum tumor markers.
American Family Physician, 68(6), 1075–1082.
Permuth-Wey, J., & Egan, K. M. (2009). Family history is a significant risk factor for pancreatic
cancer: Results from a systematic review and meta-analysis. Familial Cancer, 8(2), 109–
117. https://doi.org/10.1007/s10689-008-9214-8
Pernick, N. L., Sarkar, F. H., Philip, P. A., Arlauskas, P., Shields, A. F., Vaitkevicius, V. K., …
Adsay, N. V. (2003). Clinicopathologic Analysis of Pancreatic Adenocarcinoma in African
Americans and Caucasians. Pancreas, 26(1), 28–32. https://doi.org/10.1097/00006676-
200301000-00006
Pratt, H., Coenen, F., Broadbent, D. M., Harding, S. P., & Zheng, Y. (2016). Convolutional
122
Neural Networks for Diabetic Retinopathy. Procedia Computer Science, 90, 200–205.
https://doi.org/10.1016/J.PROCS.2016.07.014
Ravishankar, H., Sudhakar, P., Venkataramani, R., Thiruvenkadam, S., Annangi, P., Babu, N., &
Vaidya, V. Understanding the Mechanisms of Deep Transfer Learning for Medical Images.
https://doi.org/10.1007/978-3-319-46976-8_20
Razzak, M. I., Naz, S., & Zaib, A. (n.d.). Deep Learning for Medical Image Processing:
Overview, Challenges and Future. Retrieved from https://arxiv.org/pdf/1704.06825.pdf
Ren, J., Tian, J., Yuan, Y., Dong, D., Li, X., Shi, Y., & Tao, X. (2018). Magnetic resonance
imaging based radiomics signature for the preoperative discrimination of stage I-II and III-
IV head and neck squamous cell carcinoma. European Journal of Radiology, 106, 1–6.
https://doi.org/10.1016/j.ejrad.2018.07.002
Rohrmann, S., Linseisen, J., Vrieling, A., Boffetta, P., Stolzenberg-Solomon, R. Z., Lowenfels,
A. B., … Bueno-de-Mesquita, H. B. (2009). Ethanol intake and the risk of pancreatic cancer
in the European prospective investigation into cancer and nutrition (EPIC). Cancer Causes
& Control, 20(5), 785–794. https://doi.org/10.1007/s10552-008-9293-8
Rokach, L. (2005). Ensemble Methods for Classifiers. In Data Mining and Knowledge Discovery
Handbook (pp. 957–980). New York: Springer-Verlag. https://doi.org/10.1007/0-387-
25465-X_45
Ronneberger, O., Fischer, P., & Brox, T. U-Net: Convolutional Networks for Biomedical Image
Segmentation (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M., … Barshack,
I. (2008). MicroRNAs accurately identify cancer tissue origin. Nature Biotechnology, 26(4),
462–469. https://doi.org/10.1038/nbt1392
Roth, H. R., Oda, H., Hayashi, Y., Oda, M., Shimizu, N., Fujiwara, M., … Roth, H. R. (2017).
Hierarchical 3D fully convolutional networks for multi-organ segmentation Hierarchical
3D fully convolutional networks. Retrieved from http://lmb.informatik.uni-
freiburg.de/resources/opensource/unet.en.html
Sadakari, Y., Ohtsuka, T., Ohuchida, K., Tsutsumi, K., Takahata, S., Nakamura, M., … Tanaka,
M. (2010). MicroRNA expression analyses in preoperative pancreatic juice samples of
pancreatic ductal adenocarcinoma. Journal of the Pancreas, 11(6), 587–592.
Sanduleanu, S., Woodruff, H. C., de Jong, E. E. C., van Timmeren, J. E., Jochems, A., Dubois,
123
L., & Lambin, P. (2018, June 1). Tracking tumor biology with radiomics: A systematic
review utilizing a radiomics quality score. Radiotherapy and Oncology. Elsevier.
https://doi.org/10.1016/j.radonc.2018.03.033
Sanghera, P., Wong, D. W. Y., McConkey, C. C., Geh, J. I., & Hartley, A. (2008).
Chemoradiotherapy for Rectal Cancer: An Updated Analysis of Factors Affecting
Pathological Response. Clinical Oncology, 20(2), 176–183.
https://doi.org/10.1016/j.clon.2007.11.013
Sargent, D. J. (2001). Comparison of artificial neural networks with other statistical approaches:
results from medical data sets. Cancer, 91(8 Suppl), 1636–1642. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/11309761
Satake, K., Kanazawa, G., Kho, I., Chung, Y. ‐s., & Umeyama, K. (1985). Evaluation of Serum
Pancreatic Enzymes, Carbohydrate Antigen 19‐9, and Carcinoembryonic Antigen in
Various Pancreatic Diseases. The American Journal of Gastroenterology, 80(8), 630–636.
https://doi.org/10.1111/j.1572-0241.1985.tb02191.x
Schmid, M., Wright, M. N., & Ziegler, A. (2016). On the use of Harrell’s C for clinical risk
prediction via random survival forests.
Schmidhuber, J. Deep Learning in Neural Networks: An Overview (2014). Retrieved from
http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.texCompleteBIBTEXfile
Schultz, N. A., Werner, J., Willenbrock, H., Roslind, A., Giese, N., Horn, T., … Johansen, J. S.
(2012). MicroRNA expression profiles associated with pancreatic adenocarcinoma and
ampullary adenocarcinoma. Modern Pathology, 25(12), 1609–1622.
https://doi.org/10.1038/modpathol.2012.122
Sebastiani, V. (2006). Immunohistochemical and Genetic Evaluation of Deoxycytidine Kinase in
Pancreatic Cancer: Relationship to Molecular Mechanisms of Gemcitabine Resistance and
Survival. Clinical Cancer Research, 12(8), 2492–2497. https://doi.org/10.1158/1078-
0432.CCR-05-2655
Sedgwick, P. (2012). Pearson’s correlation coefficient. BMJ, 345(jul04 1), e4483–e4483.
https://doi.org/10.1136/bmj.e4483
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2016). Grad-
CAM: Visual Explanations from Deep Networks via Gradient-based Localization.
Retrieved from http://arxiv.org/abs/1610.02391
124
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-
CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In 2017
IEEE International Conference on Computer Vision (ICCV) (pp. 618–626). IEEE.
https://doi.org/10.1109/ICCV.2017.74
Sharma, N., Ray, A., Shukla, K., Sharma, S., Pradhan, S., Srivastva, A., & Aggarwal, L. (2010).
Automated medical image segmentation techniques. Journal of Medical Physics, 35(1), 3.
https://doi.org/10.4103/0971-6203.58777
Siegel, R. L., Miller, K. D., & Jemal, A. (2015). Cancer statistics, 2015. CA: A Cancer Journal
for Clinicians, 65(1), 5–29. https://doi.org/10.3322/caac.21254
Siegel, R. L., Miller, K. D., Jemal, A., Rahib, L., Smith, B. D., Aizenberg, R., … Smith-Warner,
S. A. (2009). Trends in pancreatic adenocarcinoma incidence and mortality in the United
States in the last four decades; A SEER-based study. Cancer Epidemiology Biomarkers and
Prevention, 18(1), 742–746. https://doi.org/10.1097/00006676-200301000-00006
Silverman, D. T., Hoover, R. N., Brown, L. M., Swanson, G. M., Schiffman, M., Greenberg, R.
S., … Fraumeni, J. F. (2003). Why Do Black Americans Have a Higher Risk of Pancreatic
Cancer than White Americans? Epidemiology, 14(1), 45–54.
https://doi.org/10.1097/00001648-200301000-00013
Sørensen, J., Klee, M., Palshof, T., & Hansen, H. (1993). Performance status assessment in
cancer patients. An inter-observer variability study. British Journal of Cancer, 67(4), 773–
775. https://doi.org/10.1038/bjc.1993.140
Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia Medica, 24(1), 12–
18. https://doi.org/10.11613/BM.2014.003
Spratlin, J. (2004). The Absence of Human Equilibrative Nucleoside Transporter 1 Is Associated
with Reduced Survival in Patients With Gemcitabine-Treated Pancreas Adenocarcinoma.
Clinical Cancer Research, 10(20), 6956–6961. https://doi.org/10.1158/1078-0432.CCR-04-
0224
Stark, A. P., Sacks, G. D., Rochefort, M. M., Donahue, T. R., Reber, H. A., Tomlinson, J. S., …
Hines, O. J. (2016). Long-term survival in patients with pancreatic ductal adenocarcinoma.
Surgery, 159(6), 1520–1527. https://doi.org/10.1016/j.surg.2015.12.024
Steinberg, W. (1990). The clinical utility of the CA 19-9 tumor-associated antigen. American
Journal of Gastroenterology, 85(4), 350–355.
125
Stevens, R. J., Roddam, A. W., & Beral, V. (2007). Pancreatic cancer in type 1 and young-onset
diabetes: Systematic review and meta-analysis. British Journal of Cancer, 96(3), 507–509.
https://doi.org/10.1038/sj.bjc.6603571
Sun, Q.-S., Zeng, S.-G., Liu, Y., Heng, P.-A., & Xia, D.-S. (2005). A new method of feature
fusion and its application in image recognition. Pattern Recognition, 38(12), 2437–2448.
https://doi.org/10.1016/J.PATCOG.2004.12.013
Szafranska, A. E., Davison, T. S., John, J., Cannon, T., Sipos, B., Maghnouj, A., … Hahn, S. A.
(2007). MicroRNA expression alterations are linked to tumorigenesis and non-neoplastic
processes in pancreatic ductal adenocarcinoma. Oncogene, 26(30), 4442–4452.
https://doi.org/10.1038/sj.onc.1210228
Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway, M. B., & Liang,
J. (2017). Convolutional Neural Networks for Medical Image Analysis: Full Training or
Fine Tuning? IEEE Transactions on Medical Imaging, 35(5), 1299–1312.
https://doi.org/10.1109/TMI.2016.2535302
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. (2018). A Survey on Deep Transfer
Learning. Retrieved from http://arxiv.org/abs/1808.01974
Terry, M., & Therneau, M. (2018). Package “survival.” Retrieved from
https://github.com/therneau/survival
Thomaz, R. L., Carneiro, P. C., & Patrocinio, A. C. (2017). Feature extraction using
convolutional neural network for classifying breast density in mammographic images. In S.
G. Armato & N. A. Petrick (Eds.) (Vol. 10134, p. 101342M). International Society for
Optics and Photonics. https://doi.org/10.1117/12.2254633
Tibshirani, R. (1997). The lasso method for variable selection in the cox model. STATISTICS IN
MEDICINE, 16(December 1995), 385–395. https://doi.org/10.1002/(SICI)1097-
0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature
ranking and solutions. Bioinformatics, 27(14), 1986–1994.
https://doi.org/10.1093/bioinformatics/btr300
Torrey, L., & Shavlik, J. (n.d.). Transfer Learning. Retrieved from
http://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf
Traverso, A., Wee, L., Dekker, A., & Gillies, R. (2018). Repeatability and Reproducibility of
126
Radiomic Features: A Systematic Review. International Journal of Radiation
Oncology*Biology*Physics, 102(4), 1143–1158.
https://doi.org/10.1016/J.IJROBP.2018.05.053
Urruticoechea, A., Alemany, R., Balart, J., Villanueva, A., Viñals, F., & Capellá, G. (2010).
Recent advances in cancer therapy: an overview. Current Pharmaceutical Design, 16(1), 3–
10. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/20214614
van Griethuysen, J. J. M., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., …
Aerts, H. J. W. L. (2017). Computational Radiomics System to Decode the Radiographic
Phenotype. Cancer Research, 77(21), e104–e107. https://doi.org/10.1158/0008-5472.CAN-
17-0339
van Rossum, P. S. N., Fried, D. V., Zhang, L., Hofstetter, W. L., van Vulpen, M., Meijer, G.
J., … Lin, S. H. (2016). The Incremental Value of Subjective and Quantitative Assessment
of 18F-FDG PET for the Prediction of Pathologic Complete Response to Preoperative
Chemoradiotherapy in Esophageal Cancer. Journal of Nuclear Medicine, 57(5), 691–700.
https://doi.org/10.2967/jnumed.115.163766
Von Rosen, A., Linder, S., Harmenberg, U., & Pegert, S. (1993). Serum levels of CA 19-9 and
CA 50 in relation to lewis blood cell status in patients with malignant and benign pancreatic
disease. Pancreas, 8(2), 160–165.
Waddell, N., Pajic, M., Patch, A.-M., Chang, D. K., Kassahn, K. S., Bailey, P., … Grimmond, S.
M. (2015). Whole genomes redefine the mutational landscape of pancreatic cancer. Nature,
518(7540), 495–501. https://doi.org/10.1038/nature14169
Wahi, M. M., Shah, N., Schrock, C. E., Rosemurgy, A. S., & Goldin, S. B. (2009). Reproductive
Factors and Risk of Pancreatic Cancer in Women: A Review of the Literature. Annals of
Epidemiology, 19(2), 103–111. https://doi.org/10.1016/j.annepidem.2008.11.003
Wang, C.-S., Lin, K.-H., Chen, S.-L., Chan, Y.-F., & Hsueh, S. (2004). Overexpression of
SPARC gene in human gastric carcinoma and its clinic–pathologic significance. British
Journal of Cancer, 91(11), 1924–1930. https://doi.org/10.1038/sj.bjc.6602213
Wang, H., Guo, X.-H., Jia, Z.-W., Li, H.-K., Liang, Z.-G., Li, K.-C., & He, Q. (2010). Multilevel
binomial logistic prediction model for malignant pulmonary nodules based on texture
features of CT image. European Journal of Radiology, 74(1), 124–129.
https://doi.org/10.1016/j.ejrad.2009.01.024
127
Wang, X., Li, Y., Tian, H., Qi, J., Li, M., Fu, C., … Zhang, W. (2014). Macrophage inhibitory
cytokine 1 (MIC-1/GDF15) as a novel diagnostic serum biomarker in pancreatic ductal
adenocarcinoma. BMC Cancer, 14(1), 578. https://doi.org/10.1186/1471-2407-14-578
Watkins, G., Douglas-Jones, A., Bryce, R., E Mansel, R., & Jiang, W. G. (2005). Increased
levels of SPARC (osteonectin) in human breast cancer tissues and its association with
clinical outcomes. Prostaglandins, Leukotrienes and Essential Fatty Acids, 72(4), 267–272.
https://doi.org/10.1016/j.plefa.2004.12.003
Wolpin, B. M., Chan, A. T., Hartge, P., Chanock, S. J., Kraft, P., Hunter, D. J., … Fuchs, C. S.
(2009). ABO Blood Group and the Risk of Pancreatic Cancer. JNCI Journal of the National
Cancer Institute, 101(6), 424–431. https://doi.org/10.1093/jnci/djp020
Wolpin, Brian M., Kraft, P., Gross, M., Helzlsouer, K., Bueno-de-Mesquita, H. B., Steplowski,
E., … Fuchs, C. S. (2010). Pancreatic Cancer Risk and ABO Blood Group Alleles: Results
from the Pancreatic Cancer Cohort Consortium. Cancer Research, 70(3), 1015–1023.
https://doi.org/10.1158/0008-5472.CAN-09-2993
Wong, M. C. S., Jiang, J. Y., Liang, M., Fang, Y., Yeung, M. S., & Sung, J. J. Y. (2017). Global
temporal patterns of pancreatic cancer and association with socioeconomic development.
Scientific Reports, 7(1), 3165. https://doi.org/10.1038/s41598-017-02997-2
WOOD, H. E., GUPTA, S., KANG, J. Y., QUINN, M. J., MAXWELL, J. D., MUDAN, S., &
MAJEED, A. (2006). Pancreatic cancer in England and Wales 1975-2000: patterns and
trends in incidence, survival and mortality. Alimentary Pharmacology and Therapeutics,
23(8), 1205–1214. https://doi.org/10.1111/j.1365-2036.2006.02860.x
Wu, J., Aguilera, T., Shultz, D., Gudur, M., Rubin, D. L., Loo, B. W., … Li, R. (2016). Early-
Stage Non–Small Cell Lung Cancer: Quantitative Imaging Characteristics of 18 F
Fluorodeoxyglucose PET/CT Allow Prediction of Distant Metastasis. Radiology, 281(1),
270–278. https://doi.org/10.1148/radiol.2016151829
WU, X., LU, X. H., XU, T., QIAN, J. M., ZHAO, P., GUO, X. Z., … JIANG, W. J. (2006).
Evaluation of the diagnostic value of serum tumor markers, and fecal k-ras and p53 gene
mutations for pancreatic cancer. Chinese Journal of Digestive Diseases, 7(3), 170–174.
https://doi.org/10.1111/j.1443-9573.2006.00263.x
Xi, Y., Guo, F., Xu, Z., Li, C., Wei, W., Tian, P., … Yin, H. (2018). Radiomics signature: A
potential biomarker for the prediction of MGMT promoter methylation in glioblastoma.
128
Journal of Magnetic Resonance Imaging, 47(5), 1380–1387.
https://doi.org/10.1002/jmri.25860
Xiang, A., Lapuerta, P., Ryutov, A., Buckley, J., & Azen, S. (2000). Comparison of the
performance of neural network methods and Cox regression for censored survival data.
Computational Statistics & Data Analysis, 34(2), 243–257. https://doi.org/10.1016/S0167-
9473(99)00098-5
Yamada, R., Mizuno, S., Uchida, K., Yoneda, M., Kanayama, K., Inoue, H., … Isaji, S. (2016).
Human Equilibrative Nucleoside Transporter 1 Expression in Endoscopic Ultrasonography-
Guided Fine-Needle Aspiration Biopsy Samples Is a Strong Predictor of Clinical Response
and Survival in the Patients With Pancreatic Ductal Adenocarcinoma Undergoing
Gemcitabine-Based Chemoradiotherapy. Pancreas, 45(5), 761–771.
https://doi.org/10.1097/MPA.0000000000000597
Yamashita, K., Upadhay, S., Mimori, K., Inoue, H., & Mori, M. (2003). Clinical significance of
secreted protein acidic and rich in cystein in esophageal carcinoma and its relation to
carcinoma progression. Cancer, 97(10), 2412–2419. https://doi.org/10.1002/cncr.11368
Yamashita, R., Nishio, M., Do, R. K. G., Togashi, K., Kinh, R., Do, G., & Togashi, K. (2018).
Convolutional neural networks: an overview and application in radiology. Insights into
Imaging, 9(4), 611–629. https://doi.org/10.1007/s13244-018-0639-9
Yang, J.-Y., Sun, Y.-W., Liu, D.-J., Zhang, J.-F., Li, J., & Hua, R. (2014). MicroRNAs in stool
samples as potential screening biomarkers for pancreatic ductal adenocarcinoma cancer.
American Journal of Cancer Research, 4(6), 663–673.
Yang, L., Dong, D., Fang, M., Zhu, Y., Zang, Y., Liu, Z., … Tian, J. (2018). Can CT-based
radiomics signature predict KRAS/NRAS/BRAF mutations in colorectal cancer? European
Radiology, 28(5), 2058–2067. https://doi.org/10.1007/s00330-017-5146-8
Yang, P., Yang, Y. H., Zhou, B. B., & Zomaya, A. Y. (n.d.). A review of ensemble methods in
bioinformatics: * Including stability of feature selection and ensemble feature selection
methods (updated on 28 Sep. 2016). Retrieved from
http://www.maths.usyd.edu.au/u/pengyi/publication/EnsembleBioinformatics-v6.pdf
Yasaka, K., Akai, H., Abe, O., & Kiryu, S. (2018). Deep Learning with Convolutional Neural
Network for Differentiation of Liver Masses at Dynamic Contrast-enhanced CT: A
Preliminary Study. Radiology, 286(3), 887–896. https://doi.org/10.1148/radiol.2017170706
129
Yip, S. S. F., & Aerts, H. J. W. L. (2016). Applications and limitations of radiomics. Physics in
Medicine and Biology, 61(13), R150-66. https://doi.org/10.1088/0031-9155/61/13/R150
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep
neural networks? Retrieved from http://arxiv.org/abs/1411.1792
Zeiler, M. D., & Fergus, R. Visualizing and Understanding Convolutional Networks.
https://doi.org/10.1007/978-3-319-10590-1_53
Zhang, B., Tian, J., Dong, D., Gu, D., Dong, Y., Zhang, L., … Zhang, S. (2017a). Radiomics
Features of Multiparametric MRI as Novel Prognostic Factors in Advanced Nasopharyngeal
Carcinoma. Clinical Cancer Research, 23(15), 4259–4269. https://doi.org/10.1158/1078-
0432.CCR-16-2910
Zhang, B., Tian, J., Dong, D., Gu, D., Dong, Y., Zhang, L., … Zhang, S. (2017b). Radiomics
Features of Multiparametric MRI as Novel Prognostic Factors in Advanced Nasopharyngeal
Carcinoma. Clinical Cancer Research, 23(15), 4259–4269. https://doi.org/10.1158/1078-
0432.CCR-16-2910
Zhang, Junjie, Baig, S., Wong, A., Haider, M. A., & Khalvati, F. (2016). A Local ROI-specific
Atlas-based Segmentation of Prostate Gland and Transitional Zone in Diffusion MRI.
Journal of Computational Vision and Imaging Systems.
Zhang, Y, Yang, J., Li, H., Wu, Y., Zhang, H., & Chen, W. (2015). Tumor markers CA19-9,
CA242 and CEA in the diagnosis of pancreatic cancer: A meta-analysis. International
Journal of Clinical and Experimental Medicine, 8(7), 11683–11691.
Zhang, Yucheng, Oikonomou, A., Wong, A., Haider, M. A., & Khalvati, F. (2017). Radiomics-
based Prognosis Analysis for Non-Small Cell Lung Cancer. Nature Scientific Reports,
7(46349). https://doi.org/10.1038/srep16630
Zhao, B., Tan, Y., Tsai, W.-Y., Qi, J., Xie, C., Lu, L., & Schwartz, L. H. (2016). Reproducibility
of radiomics for deciphering tumor phenotype with imaging. Scientific Reports, 6, 23428.
https://doi.org/10.1038/srep23428
Zhao, X., Wu, Y., Song, G., Li, Z., Zhang, Y., & Fan, Y. (2018). A deep learning model
integrating FCNNs and CRFs for brain tumor segmentation. Medical Image Analysis, 43,
98–111. https://doi.org/10.1016/j.media.2017.10.002
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (n.d.). Learning Deep Features
for Discriminative Localization. Retrieved from http://cnnlocalization.csail.mit.edu
130
Zhou, H., Dong, D., Chen, B., Fang, M., Cheng, Y., Gan, Y., … Tian, J. (2018). Diagnosis of
Distant Metastasis of Lung Cancer: Based on Clinical and Radiomic Features. Translational
Oncology, 11(1), 31–36. https://doi.org/10.1016/j.tranon.2017.10.010
Zhou, Z., Chen, L., Sher, D., Zhang, Q., Shah, J., Pham, N.-L., … Wang, J. (2018). Predicting
Lymph Node Metastasis in Head and Neck Cancer by Combining Many-objective
Radiomics and 3-dimensioal Convolutional Neural Network through Evidential Reasoning.
Retrieved from http://arxiv.org/abs/1805.07021
Zwanenburg, A., Leger, S., Vallières, M., & Löck, S. (2016). Image biomarker standardisation
initiative. Retrieved from http://arxiv.org/abs/1612.07003
131
Appendix
Table A1: List of significant PyRadiomics feature for PDAC prognosis
Feature HR 95% CI p value
wavelet.LHL_glcm_Contrast 1.607 1.214 ~ 2.128 0.001
wavelet.HLH_glszm_HighGrayLevelZoneEmphasis 0.619 0.465 ~ 0.826 0.001
wavelet.HLH_glszm_LowGrayLevelZoneEmphasis 1.614 1.211 ~ 2.152 0.001
wavelet.LHL_glcm_DifferenceVariance 1.506 1.176 ~ 1.929 0.001
wavelet.LHH_firstorder_Variance 1.55 1.179 ~ 2.038 0.002
gradient_gldm_SmallDependenceEmphasis 1.536 1.172 ~ 2.012 0.002
wavelet.LHL_glcm_SumSquares 1.505 1.161 ~ 1.951 0.002
gradient_glszm_ZonePercentage 1.482 1.153 ~ 1.905 0.002
wavelet.LHH_firstorder_Minimum 0.669 0.517 ~ 0.865 0.002
wavelet.LHL_firstorder_Variance 1.493 1.154 ~ 1.932 0.002
wavelet.LHL_gldm_GrayLevelVariance 1.492 1.153 ~ 1.932 0.002
wavelet.LHH_firstorder_RootMeanSquared 1.563 1.168 ~ 2.091 0.003
wavelet.LHL_glszm_SizeZoneNonUniformityNormalized 1.598 1.171 ~ 2.181 0.003
wavelet.LHH_firstorder_MeanAbsoluteDeviation 1.56 1.161 ~ 2.097 0.003
wavelet.LHL_firstorder_MeanAbsoluteDeviation 1.563 1.16 ~ 2.106 0.003
wavelet.LHL_glrlm_GrayLevelVariance 1.456 1.132 ~ 1.873 0.003
wavelet.LHL_firstorder_RootMeanSquared 1.53 1.148 ~ 2.039 0.004
wavelet.LHH_firstorder_90Percentile 1.515 1.142 ~ 2.008 0.004
gradient_glcm_DifferenceAverage 1.502 1.138 ~ 1.981 0.004
132
wavelet.LHL_firstorder_90Percentile 1.527 1.144 ~ 2.037 0.004
wavelet.LHL_glcm_DifferenceAverage 1.523 1.137 ~ 2.038 0.005
wavelet.LHL_glszm_SmallAreaEmphasis 1.587 1.149 ~ 2.192 0.005
gradient_ngtdm_Contrast 1.413 1.11 ~ 1.8 0.005
wavelet.LHL_gldm_SmallDependenceEmphasis 1.527 1.135 ~ 2.054 0.005
gradient_gldm_SmallDependenceLowGrayLevelEmphasis 1.447 1.114 ~ 1.879 0.006
wavelet.LLL_glcm_DifferenceAverage 1.495 1.122 ~ 1.992 0.006
wavelet.LHL_ngtdm_Complexity 1.461 1.115 ~ 1.915 0.006
gradient_glcm_JointAverage 1.499 1.123 ~ 2.003 0.006
gradient_glcm_SumAverage 1.499 1.123 ~ 2.003 0.006
wavelet.HLH_glszm_SmallAreaHighGrayLevelEmphasis 0.639 0.464 ~ 0.88 0.006
wavelet.LHL_glcm_ClusterTendency 1.411 1.101 ~ 1.809 0.007
gradient_firstorder_Mean 1.484 1.116 ~ 1.975 0.007
wavelet.LHL_firstorder_RobustMeanAbsoluteDeviation 1.484 1.115 ~ 1.975 0.007
wavelet.LHL_firstorder_10Percentile 0.672 0.504 ~ 0.897 0.007
wavelet.LHH_firstorder_RobustMeanAbsoluteDeviation 1.476 1.111 ~ 1.961 0.007
wavelet.LHL_glcm_DifferenceEntropy 1.514 1.119 ~ 2.048 0.007
wavelet.LHL_glszm_ZonePercentage 1.5 1.116 ~ 2.017 0.007
gradient_glcm_DifferenceEntropy 1.526 1.12 ~ 2.078 0.007
wavelet.HLH_firstorder_MeanAbsoluteDeviation 1.444 1.104 ~ 1.889 0.007
original_glcm_Contrast 1.46 1.106 ~ 1.926 0.007
gradient_glrlm_RunLengthNonUniformityNormalized 1.504 1.114 ~ 2.031 0.008
133
wavelet.LHL_firstorder_Entropy 1.515 1.116 ~ 2.058 0.008
wavelet.LHL_glcm_SumEntropy 1.51 1.114 ~ 2.047 0.008
original_glcm_DifferenceAverage 1.465 1.104 ~ 1.946 0.008
wavelet.LHL_firstorder_InterquartileRange 1.459 1.101 ~ 1.934 0.009
wavelet.HLL_firstorder_MeanAbsoluteDeviation 1.432 1.094 ~ 1.876 0.009
wavelet.LHH_firstorder_InterquartileRange 1.453 1.097 ~ 1.926 0.009
wavelet.LHL_glcm_JointEntropy 1.488 1.101 ~ 2.011 0.01
wavelet.HHL_glcm_Contrast 1.405 1.086 ~ 1.817 0.01
gradient_glrlm_ShortRunEmphasis 1.533 1.108 ~ 2.121 0.01
wavelet.LHL_glszm_GrayLevelNonUniformityNormalized 0.658 0.478 ~ 0.905 0.01
wavelet.HHL_glcm_SumSquares 1.39 1.08 ~ 1.79 0.011
gradient_glcm_Idm 0.682 0.508 ~ 0.915 0.011
squareroot_glcm_Contrast 1.474 1.094 ~ 1.988 0.011
original_glcm_SumSquares 1.445 1.088 ~ 1.919 0.011
original_glcm_DifferenceEntropy 1.458 1.09 ~ 1.949 0.011
wavelet.HHL_firstorder_RootMeanSquared 1.42 1.082 ~ 1.862 0.011
gradient_glcm_Id 0.683 0.508 ~ 0.918 0.011
wavelet.LHL_glrlm_RunEntropy 1.475 1.091 ~ 1.994 0.011
wavelet.HHL_firstorder_Variance 1.367 1.072 ~ 1.743 0.012
original_glcm_Idm 0.689 0.516 ~ 0.921 0.012
wavelet.LHL_glcm_Id 0.68 0.504 ~ 0.919 0.012
wavelet.HLL_glcm_Imc2 1.433 1.081 ~ 1.898 0.012
134
wavelet.HHL_glcm_ClusterTendency 1.37 1.071 ~ 1.752 0.012
wavelet.HHL_glcm_DifferenceVariance 1.348 1.067 ~ 1.703 0.012
original_glcm_Id 0.69 0.516 ~ 0.924 0.013
wavelet.HHL_gldm_GrayLevelVariance 1.364 1.069 ~ 1.741 0.013
wavelet.HLL_glcm_DifferenceAverage 1.43 1.079 ~ 1.895 0.013
wavelet.LHH_firstorder_10Percentile 0.7 0.528 ~ 0.927 0.013
wavelet.LHL_glcm_Idm 0.684 0.507 ~ 0.923 0.013
squareroot_glcm_SumSquares 1.459 1.082 ~ 1.966 0.013
original_glcm_JointEntropy 1.459 1.082 ~ 1.967 0.013
original_glrlm_RunLengthNonUniformityNormalized 1.447 1.08 ~ 1.938 0.013
gradient_glszm_SmallAreaEmphasis 1.408 1.072 ~ 1.849 0.014
wavelet.HHL_gldm_SmallDependenceEmphasis 1.425 1.075 ~ 1.89 0.014
wavelet.HHL_glrlm_GrayLevelVariance 1.345 1.061 ~ 1.706 0.014
gradient_glrlm_RunPercentage 1.448 1.077 ~ 1.946 0.014
wavelet.LHL_glrlm_RunLengthNonUniformityNormalized 1.468 1.08 ~ 1.997 0.014
gradient_firstorder_Entropy 1.458 1.078 ~ 1.973 0.014
gradient_firstorder_90Percentile 1.429 1.073 ~ 1.902 0.014
wavelet.LLL_gldm_DependenceNonUniformityNormalized 1.455 1.077 ~ 1.967 0.015
wavelet.LHH_firstorder_Range 1.406 1.069 ~ 1.848 0.015
gradient_glcm_JointEntropy 1.445 1.075 ~ 1.943 0.015
original_firstorder_MeanAbsoluteDeviation 1.469 1.078 ~ 2.001 0.015
wavelet.HHL_firstorder_MeanAbsoluteDeviation 1.398 1.067 ~ 1.831 0.015
135
wavelet.HHH_firstorder_RootMeanSquared 1.388 1.065 ~ 1.808 0.015
gradient_firstorder_MeanAbsoluteDeviation 1.349 1.06 ~ 1.718 0.015
wavelet.HHH_firstorder_MeanAbsoluteDeviation 1.398 1.066 ~ 1.832 0.015
wavelet.LHL_glrlm_GrayLevelNonUniformityNormalized 0.663 0.476 ~ 0.925 0.015
original_glrlm_ShortRunEmphasis 1.459 1.075 ~ 1.98 0.015
wavelet.LHL_gldm_DependenceEntropy 1.485 1.078 ~ 2.045 0.016
original_glcm_DifferenceVariance 1.372 1.062 ~ 1.773 0.016
wavelet.HHL_glszm_ZonePercentage 1.411 1.067 ~ 1.865 0.016
gradient_glcm_Autocorrelation 1.321 1.053 ~ 1.657 0.016
gradient_gldm_LowGrayLevelEmphasis 0.701 0.524 ~ 0.937 0.016
original_gldm_GrayLevelVariance 1.395 1.063 ~ 1.83 0.016
squareroot_glcm_DifferenceAverage 1.442 1.07 ~ 1.945 0.016
wavelet.HHH_firstorder_10Percentile 0.723 0.554 ~ 0.942 0.016
wavelet.LLL_gldm_SmallDependenceEmphasis 1.405 1.064 ~ 1.854 0.016
wavelet.HLL_firstorder_90Percentile 1.427 1.066 ~ 1.908 0.017
wavelet.LLL_glrlm_RunLengthNonUniformityNormalized 1.445 1.069 ~ 1.953 0.017
wavelet.LHL_firstorder_Uniformity 0.677 0.492 ~ 0.932 0.017
wavelet.LHL_glcm_Imc2 1.414 1.064 ~ 1.879 0.017
squareroot_ngtdm_Complexity 1.405 1.062 ~ 1.857 0.017
wavelet.LHL_glrlm_RunPercentage 1.461 1.069 ~ 1.996 0.017
original_glszm_ZonePercentage 1.405 1.062 ~ 1.86 0.017
original_firstorder_Variance 1.39 1.059 ~ 1.822 0.017
136
wavelet.HHL_glcm_DifferenceAverage 1.382 1.058 ~ 1.805 0.017
gradient_gldm_LargeDependenceEmphasis 0.697 0.518 ~ 0.939 0.018
wavelet.HLH_firstorder_RootMeanSquared 1.306 1.047 ~ 1.628 0.018
gradient_gldm_LargeDependenceLowGrayLevelEmphasis 0.699 0.52 ~ 0.94 0.018
wavelet.LLL_glcm_Id 0.697 0.516 ~ 0.94 0.018
gradient_glcm_SumEntropy 1.439 1.064 ~ 1.945 0.018
wavelet.HLH_firstorder_90Percentile 1.419 1.061 ~ 1.896 0.018
wavelet.LLL_glszm_ZonePercentage 1.415 1.061 ~ 1.887 0.018
squareroot_glcm_ClusterProminence 1.379 1.056 ~ 1.8 0.018
wavelet.HLL_firstorder_RootMeanSquared 1.31 1.047 ~ 1.64 0.018
wavelet.LHL_glrlm_ShortRunEmphasis 1.488 1.07 ~ 2.071 0.018
original_glrlm_RunPercentage 1.427 1.062 ~ 1.917 0.018
squareroot_glcm_DifferenceVariance 1.427 1.061 ~ 1.917 0.018
wavelet.LLL_glcm_DifferenceEntropy 1.432 1.062 ~ 1.932 0.019
gradient_firstorder_Median 1.411 1.059 ~ 1.88 0.019
wavelet.HHL_glrlm_RunVariance 0.692 0.509 ~ 0.941 0.019
wavelet.LLL_glcm_Idm 0.7 0.519 ~ 0.943 0.019
original_gldm_SmallDependenceEmphasis 1.395 1.056 ~ 1.844 0.019
wavelet.HHL_firstorder_RobustMeanAbsoluteDeviation 1.368 1.052 ~ 1.778 0.019
squareroot_gldm_SmallDependenceHighGrayLevelEmphasis 1.448 1.062 ~ 1.975 0.019
original_firstorder_Entropy 1.444 1.061 ~ 1.964 0.019
squareroot_gldm_GrayLevelVariance 1.428 1.059 ~ 1.927 0.02
137
squareroot_firstorder_Variance 1.425 1.058 ~ 1.92 0.02
wavelet.HHL_glcm_Id 0.718 0.544 ~ 0.949 0.02
wavelet.HHL_glcm_Idm 0.719 0.544 ~ 0.949 0.02
squareroot_glszm_SizeZoneNonUniformityNormalized 1.44 1.059 ~ 1.957 0.02
wavelet.HHH_firstorder_Variance 1.318 1.044 ~ 1.665 0.02
wavelet.HHL_glrlm_RunPercentage 1.404 1.054 ~ 1.872 0.02
wavelet.HLH_firstorder_RobustMeanAbsoluteDeviation 1.399 1.053 ~ 1.859 0.021
wavelet.LLL_glrlm_ShortRunEmphasis 1.437 1.057 ~ 1.954 0.021
original_glszm_LargeAreaLowGrayLevelEmphasis 0.612 0.404 ~ 0.928 0.021
wavelet.HHH_firstorder_90Percentile 1.367 1.049 ~ 1.781 0.021
wavelet.LHL_ngtdm_Contrast 1.358 1.047 ~ 1.762 0.021
wavelet.HHL_firstorder_InterquartileRange 1.357 1.047 ~ 1.761 0.021
wavelet.HHL_glcm_SumEntropy 1.391 1.05 ~ 1.844 0.022
wavelet.HHL_firstorder_90Percentile 1.362 1.046 ~ 1.773 0.022
wavelet.HHL_glcm_DifferenceEntropy 1.392 1.049 ~ 1.847 0.022
gradient_firstorder_10Percentile 1.392 1.048 ~ 1.848 0.022
original_gldm_LargeDependenceEmphasis 0.704 0.52 ~ 0.951 0.022
wavelet.LHL_glcm_InverseVariance 0.715 0.536 ~ 0.954 0.022
gradient_glszm_LargeAreaLowGrayLevelEmphasis 0.635 0.43 ~ 0.938 0.023
wavelet.LLL_glszm_SizeZoneNonUniformityNormalized 1.364 1.045 ~ 1.78 0.023
wavelet.LLL_glszm_SmallAreaEmphasis 1.387 1.047 ~ 1.837 0.023
wavelet.HLH_firstorder_InterquartileRange 1.391 1.047 ~ 1.848 0.023
138
wavelet.HHL_firstorder_10Percentile 0.734 0.562 ~ 0.958 0.023
squareroot_glszm_SmallAreaEmphasis 1.452 1.053 ~ 2.001 0.023
gradient_glcm_MaximumProbability 0.715 0.536 ~ 0.954 0.023
wavelet.HHH_firstorder_RobustMeanAbsoluteDeviation 1.356 1.043 ~ 1.764 0.023
wavelet.HHL_glcm_MaximumProbability 0.723 0.547 ~ 0.956 0.023
wavelet.HHL_glcm_JointEntropy 1.385 1.046 ~ 1.835 0.023
gradient_glszm_LargeAreaEmphasis 0.638 0.432 ~ 0.94 0.023
squareroot_glrlm_GrayLevelVariance 1.411 1.048 ~ 1.9 0.023
wavelet.HHL_gldm_LargeDependenceEmphasis 0.716 0.536 ~ 0.956 0.023
wavelet.LHL_gldm_DependenceNonUniformityNormalized 1.37 1.043 ~ 1.8 0.024
gradient_glcm_InverseVariance 1.408 1.047 ~ 1.893 0.024
gradient_glszm_ZoneVariance 0.642 0.438 ~ 0.943 0.024
wavelet.HHL_glrlm_LongRunEmphasis 0.705 0.52 ~ 0.954 0.024
wavelet.HHL_glrlm_RunLengthNonUniformityNormalized 1.379 1.044 ~ 1.823 0.024
wavelet.HHL_firstorder_Entropy 1.384 1.044 ~ 1.835 0.024
wavelet.HHL_firstorder_Uniformity 0.72 0.542 ~ 0.957 0.024
gradient_glszm_SizeZoneNonUniformityNormalized 1.346 1.04 ~ 1.741 0.024
squareroot_gldm_SmallDependenceEmphasis 1.419 1.047 ~ 1.923 0.024
gradient_firstorder_Uniformity 0.707 0.523 ~ 0.955 0.024
squareroot_glcm_ClusterTendency 1.392 1.044 ~ 1.855 0.024
squareroot_gldm_DependenceNonUniformityNormalized 1.398 1.045 ~ 1.87 0.024
original_glcm_ClusterTendency 1.389 1.044 ~ 1.849 0.024
139
wavelet.LHL_glszm_GrayLevelVariance 1.338 1.039 ~ 1.723 0.024
gradient_glcm_JointEnergy 0.705 0.52 ~ 0.956 0.024
gradient_firstorder_RootMeanSquared 1.33 1.038 ~ 1.706 0.024
wavelet.HHH_firstorder_InterquartileRange 1.349 1.039 ~ 1.751 0.025
wavelet.HLL_firstorder_Entropy 1.401 1.044 ~ 1.88 0.025
wavelet.HHL_glcm_JointEnergy 0.715 0.533 ~ 0.959 0.025
wavelet.LLL_glcm_InverseVariance 0.721 0.542 ~ 0.96 0.025
gradient_gldm_DependenceEntropy 1.437 1.046 ~ 1.975 0.025
wavelet.HHL_glrlm_ShortRunEmphasis 1.394 1.042 ~ 1.866 0.025
wavelet.LHL_gldm_SmallDependenceHighGrayLevelEmphasis 1.374 1.04 ~ 1.815 0.026
wavelet.HHL_gldm_DependenceEntropy 1.387 1.04 ~ 1.848 0.026
wavelet.LHL_gldm_LargeDependenceEmphasis 0.693 0.502 ~ 0.957 0.026
wavelet.LLL_glrlm_RunPercentage 1.418 1.042 ~ 1.928 0.026
gradient_glszm_LargeAreaHighGrayLevelEmphasis 0.653 0.448 ~ 0.951 0.026
wavelet.HLL_firstorder_RobustMeanAbsoluteDeviation 1.382 1.039 ~ 1.839 0.026
wavelet.HLL_gldm_SmallDependenceEmphasis 1.376 1.036 ~ 1.828 0.028
original_firstorder_RobustMeanAbsoluteDeviation 1.399 1.038 ~ 1.886 0.028
wavelet.HLL_glcm_SumEntropy 1.394 1.037 ~ 1.873 0.028
wavelet.HHL_glrlm_GrayLevelNonUniformityNormalized 0.718 0.534 ~ 0.964 0.028
gradient_firstorder_RobustMeanAbsoluteDeviation 1.37 1.035 ~ 1.814 0.028
squareroot_glszm_ZonePercentage 1.412 1.038 ~ 1.922 0.028
original_glcm_SumEntropy 1.41 1.037 ~ 1.916 0.028
140
wavelet.HLL_glcm_DifferenceEntropy 1.386 1.035 ~ 1.856 0.028
original_glcm_JointEnergy 0.703 0.513 ~ 0.964 0.028
gradient_glrlm_ShortRunLowGrayLevelEmphasis 1.411 1.036 ~ 1.921 0.029
wavelet.LLL_ngtdm_Contrast 1.257 1.024 ~ 1.543 0.029
original_glrlm_LongRunLowGrayLevelEmphasis 0.713 0.525 ~ 0.967 0.03
wavelet.HLL_ngtdm_Busyness 0.685 0.487 ~ 0.964 0.03
squareroot_glcm_InverseVariance 0.714 0.526 ~ 0.968 0.03
original_gldm_DependenceNonUniformityNormalized 1.346 1.029 ~ 1.762 0.03
original_ngtdm_Contrast 1.331 1.027 ~ 1.725 0.031
squareroot_glcm_Id 0.708 0.517 ~ 0.97 0.031
wavelet.HLL_firstorder_10Percentile 0.741 0.564 ~ 0.974 0.032
original_firstorder_Uniformity 0.711 0.521 ~ 0.971 0.032
wavelet.HLH_firstorder_10Percentile 0.738 0.559 ~ 0.974 0.032
wavelet.LHL_glcm_JointEnergy 0.683 0.481 ~ 0.969 0.033
gradient_gldm_HighGrayLevelEmphasis 1.278 1.02 ~ 1.601 0.033
original_gldm_LargeDependenceLowGrayLevelEmphasis 0.715 0.526 ~ 0.973 0.033
original_gldm_SmallDependenceHighGrayLevelEmphasis 1.335 1.023 ~ 1.742 0.033
wavelet.HLL_glcm_JointEntropy 1.37 1.024 ~ 1.832 0.034
wavelet.HLL_glszm_GrayLevelNonUniformityNormalized 0.727 0.541 ~ 0.977 0.035
squareroot_glcm_Idm 0.71 0.517 ~ 0.976 0.035
wavelet.HLL_glcm_Id 0.732 0.547 ~ 0.978 0.035
original_glrlm_GrayLevelNonUniformityNormalized 0.714 0.521 ~ 0.977 0.035
141
wavelet.LLL_glcm_JointEntropy 1.385 1.023 ~ 1.874 0.035
wavelet.HLL_firstorder_InterquartileRange 1.355 1.021 ~ 1.799 0.036
wavelet.HLL_glcm_Idm 0.734 0.55 ~ 0.98 0.036
gradient_glrlm_GrayLevelNonUniformityNormalized 0.731 0.545 ~ 0.98 0.036
gradient_gldm_DependenceNonUniformity 0.711 0.517 ~ 0.978 0.036
gradient_firstorder_InterquartileRange 1.349 1.019 ~ 1.785 0.036
squareroot_glcm_DifferenceEntropy 1.386 1.02 ~ 1.885 0.037
original_firstorder_InterquartileRange 1.37 1.019 ~ 1.843 0.037
squareroot_firstorder_MeanAbsoluteDeviation 1.38 1.019 ~ 1.869 0.038
wavelet.LLL_glcm_SumSquares 1.28 1.014 ~ 1.616 0.038
original_glszm_LargeAreaEmphasis 0.669 0.458 ~ 0.978 0.038
wavelet.HLL_glszm_SmallAreaEmphasis 1.387 1.018 ~ 1.89 0.038
wavelet.HLL_glszm_ZonePercentage 1.35 1.016 ~ 1.793 0.038
original_glrlm_LongRunEmphasis 0.707 0.509 ~ 0.982 0.038
wavelet.HLL_glrlm_GrayLevelNonUniformityNormalized 0.725 0.535 ~ 0.983 0.038
wavelet.LLL_gldm_LargeDependenceEmphasis 0.714 0.519 ~ 0.982 0.039
original_glszm_ZoneVariance 0.67 0.459 ~ 0.979 0.039
wavelet.HHL_ngtdm_Complexity 1.302 1.013 ~ 1.674 0.039
wavelet.LHL_glszm_SmallAreaHighGrayLevelEmphasis 1.341 1.014 ~ 1.773 0.04
logarithm_ngtdm_Complexity 1.374 1.015 ~ 1.862 0.04
wavelet.HLL_glrlm_RunPercentage 1.36 1.014 ~ 1.824 0.04
wavelet.HLL_firstorder_Uniformity 0.734 0.546 ~ 0.987 0.041
142
logarithm_glcm_ClusterProminence 1.376 1.013 ~ 1.87 0.041
wavelet.LHL_glrlm_LongRunEmphasis 0.681 0.471 ~ 0.985 0.041
wavelet.LHL_glrlm_ShortRunHighGrayLevelEmphasis 1.344 1.011 ~ 1.786 0.042
wavelet.LLL_glszm_LargeAreaLowGrayLevelEmphasis 0.689 0.481 ~ 0.987 0.042
wavelet.HLL_glrlm_RunLengthNonUniformityNormalized 1.354 1.011 ~ 1.815 0.042
squareroot_glszm_SmallAreaHighGrayLevelEmphasis 1.379 1.011 ~ 1.882 0.042
gradient_glrlm_LowGrayLevelRunEmphasis 0.733 0.542 ~ 0.99 0.043
wavelet.LLL_glcm_ClusterTendency 1.288 1.008 ~ 1.647 0.043
wavelet.LLL_glcm_Contrast 1.254 1.007 ~ 1.562 0.043
wavelet.LHL_gldm_GrayLevelNonUniformity 0.736 0.546 ~ 0.991 0.043
wavelet.LLL_firstorder_Variance 1.274 1.007 ~ 1.612 0.044
wavelet.LLL_gldm_GrayLevelVariance 1.274 1.007 ~ 1.611 0.044
squareroot_glrlm_RunLengthNonUniformityNormalized 1.394 1.009 ~ 1.925 0.044
wavelet.LHL_glcm_MaximumProbability 0.718 0.519 ~ 0.991 0.044
logarithm_glszm_SizeZoneNonUniformityNormalized 1.355 1.008 ~ 1.823 0.044
wavelet.LHL_glrlm_RunVariance 0.693 0.485 ~ 0.991 0.045
wavelet.HLL_gldm_LargeDependenceEmphasis 0.738 0.549 ~ 0.993 0.045
squareroot_glszm_GrayLevelVariance 1.339 1.006 ~ 1.782 0.045
original_glcm_MaximumProbability 0.725 0.529 ~ 0.993 0.045
logarithm_glszm_SmallAreaEmphasis 1.371 1.006 ~ 1.869 0.046
squareroot_gldm_GrayLevelNonUniformity 0.739 0.549 ~ 0.994 0.046
original_glrlm_GrayLevelVariance 1.295 1.005 ~ 1.669 0.046
143
wavelet.HLL_glszm_SizeZoneNonUniformityNormalized 1.337 1.005 ~ 1.777 0.046
wavelet.LLL_firstorder_MeanAbsoluteDeviation 1.35 1.006 ~ 1.813 0.046
wavelet.HLL_glrlm_ShortRunEmphasis 1.358 1.004 ~ 1.837 0.047
squareroot_firstorder_RobustMeanAbsoluteDeviation 1.346 1.004 ~ 1.806 0.047
wavelet.LHH_firstorder_TotalEnergy 1.313 1.003 ~ 1.719 0.048
squareroot_firstorder_90Percentile 1.334 1.003 ~ 1.775 0.048
wavelet.LHL_firstorder_TotalEnergy 1.317 1.002 ~ 1.731 0.048
gradient_glcm_Contrast 1.251 1.002 ~ 1.563 0.048
wavelet.LLL_gldm_SmallDependenceHighGrayLevelEmph 1.332 1.002 ~ 1.77 0.048
wavelet.HHL_glcm_InverseVariance 0.773 0.598 ~ 0.999 0.049
squareroot_glrlm_RunPercentage 1.399 1.001 ~ 1.954 0.049