melanoma detection using deep learning methods · lesions and 30 melanomas) and 600 test images...
TRANSCRIPT
Melanoma Detection using deep learning methods
Joana Barbosa Miranda de Garcia Baleiras
Instituto Superior Técnico, Lisbon, Portugal
Junho de 2019
ABSTRACT
Melanoma stands out as the deadliest skin cancer due to its
propensity to metastasize. Early detection is crucial to deliver
successful treatment. Computer-Aided Diagnosis (CAD)
systems have been developed to provide an automated
melanoma detection. This paper proposes a new CAD system
to distinguish between melanoma and benign skin lesions. The
novelty lies in the use and comparison of three different
Convolutional Neural Networks (CNN) at a time, namely
AlexNet, VGG and ResNet. These CNN were trained using
transfer learning and data augmentation. The image dataset
was extracted from the ISIC 2017 Challenge. The original
dataset contains 2000 training images (1626 benign lesions
and 374 melanomas), 150 validation images (120 benign
lesions and 30 melanomas) and 600 test images (483 benign
lesions and 117 melanomas). The training set was then
augmented and contained 1626 benign examples and 1870
malign examples. For a test set of 600 images, ResNet reached
a sensitivity of 79%, a specificiy of 60%, a balanced accuracy
of 69% and an area under the receiver operating characteristic
curve of 77%. Hence, the proposed system demonstrates the
potential of CNN in CAD systems for automated melanoma
detection.
Keywords — Melanoma, CAD system, automated
melanoma detection, CNN, AlexNet, VGG, ResNet, transfer
learning, data augmentation, ISIC Challenge
1 INTRODUCTION
Skin cancer is a serious public health problem [1]. The IARC1
estimated 1.3 million new skin cancer cases and 125 thousand
skin cancer deaths around the world in 2018 only [2].
Among the various types of skin cancer, melanoma stands
out as the deadliest of all. Globally, melanoma accounts for
75% of skin cancer deaths [3,4], although it only accounts for
1% of all skin cancers [5].
Melanoma can be treated with a simple excision if it is
detected in its early stage (melanoma in situ or stage I of
melanoma) [3,5,6]. Otherwise, with signs of metastasis spread
to other parts of the body, only 20% of people survive the first
five years after being diagnosed, according to the American
Cancer Society (ACS) [7]. Therefore, the early detection of
melanoma is vital.
Clinically, the diagnosis of skin lesions is performed by
visual examination with or without the help of a dermatoscope.
1 The International Agency for Research on Cancer (IARC) is
the World Health Organization’s arm specialized in
carcinogenic issues.
This device is a dermatoscopy instrument allowing for detailed
visualisation of diagnostic structures that are not detectable by
the human eye [8]. When a skin lesion is ambiguous,
dermatologists may call for histological analysis.
However, the clinical method is subjective since it depends
on the doctor’s experience and his/her visual acuity. Therefore,
there has been an endeavour to develop automated systems for
the diagnosis of skin lesions, called Computer-Aided
Diagnosis (CAD) systems [4,9,10,11,12,13]. Generally, a
CAD system comprises four main steps: “image pre-
processing”, “lesion segmentation”, “feature extraction (and
selection)” and “classification”. The last step can be performed
with one or a few Machine Learning (ML) methods. Common
ML methods implemented in CAD systems to classify of skin
lesions are Artificial Neural Networks (ANN), Support Vector
Machine (SVM) and Decision Trees (DT) [14,15,16].
In recent years, a special type of deep ANN (called Deep
Learning, DL), the “convolutional neural networks” (CNN),
has emerged as the favourite method to detect and classify
objects due to their success triggered by the annual ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC) [17]. In
a CAD system, a CNN is an end-to-end approach, i.e., it is able
to play a role in all four steps. A neural network can work either
as a method in one of the CAD steps (e.g., as a classifier in the
classification step) or as an all-in-one method (as an end-to-
end approach). However, a huge image dataset is crucial to
produce reliable CNN results. For this reason, only a few
works of CNN-based CAD systems for melanoma detection
exist. Section 2.2. presents those works.
The first public database of melanoma images became
available merely in 2016 with the International Skin Imaging
Collaboration (ISIC) Challenge [18].
The principal motivation of this work is to take advantage
of the available ISIC database and to develop a CNN-based
CAD system for melanoma diagnosis. This work tests and
compares three different types of CNN: AlexNet, VGG and
ResNet [19,20,21]. The objective of this CAD system is to
classify skin lesions as melanoma and non-melanoma and
potentiate the use of CNN as a melanoma early diagnostic.
This paper is organised as follows. Section 2 addresses the
melanoma disease, and presents the clinical and automatic
detection methods in use. Section 3 overviews the CNN
structure, in general, and the specific architecture of the three
particular CNN used in this work. Section 4 describes the
proposed melanoma’s detection system. Section 5 discusses
the experimental results. Finally, Section 6 concludes and
suggests possible future work avenues.
2 MELANOMA DETECTION
2.1 The Melanoma
Cancer is caused by abnormal and unexpected development of
human cells [22]. Skin cancer, in particular, develops in
abnormal epidermal or dermal cells since epidermis and dermis
are the two principal layers of the skin [3]. Basal cell
carcinoma, squamous cell carcinoma and melanoma are the
three most common skin cancers [23].
Skin lesions, or moles/nevi, fall into two categories:
melanocytic and non-melanocytic. Pigmentary lesions
(agglomerates of melanocytic cells) represent the former. The
latter includes all types of skin lesions that are not formed by
melanocytic cells, such as seborrheic keratosis, vascular nevi
and dermatofibromas. Each skin lesion is described by a set of
dermatoscopic structures (e.g., asymmetry, (ir)regular
dots/globules, and (a)typical pigment network) and sometimes
malignant and benign lesions share a few of them.
Melanoma enters the melanocytic category. This tumour is
triggered by a malignant growth of melanocytic cells, which
are localized in the epidermis [24] and are responsible for the
melanin production, i.e., the pigment that gives colour to the
skin and hair [3]. Until an abnormal development, every
melanocytic lesion is benign. Otherwise, with signs of
malignancy, it is called “malignant melanoma” (or
“melanoma” for simplicity) and, for advanced stages, the
patient may not survive [3].
There are a few clinical methods to detect melanoma, such
as the ABCD Rule [25] and Seven-Point Checklist [26]. (For
the interested reader may refer to Section 2.4 in [27].) Since
clinical methods are subjective and risk to fail melanoma
detection in its early stage, automatic algorithms (CAD
systems) for melanoma detection have recently been developed
to assist with the clinical diagnosis. Section 2.2. reviews briefly
these previous CAD system works on melanoma diagnosis.
2.2 Automatic Methods – CAD systems
Huge difficulties on the correct visual examination of
cutaneous lesions by unexperienced dermatologists [12] and
the need to accelerate the melanoma detection led researchers
to develop computer-aided diagnosis (CAD) systems for
automated melanoma detection. The main purposes of these
systems are to help dermatologists in their clinical evaluation
of dermatoscopic lesions [28] and to speed-up the screening
process, by allowing a pre-selection of suspicious lesions and
a faster detection. Hence, these systems may increase the
diagnostic accuracy.
2.2.1 The four steps of classic CAD systems
CAD systems proceed in four consecutive steps or blocks:
pre-processing, segmentation, feature extraction and selection,
and classification [9,29]. The last step relies on ML algorithms
(such as SVM, ANN, logistic regression, k-nearest neighbours
and decision trees) and the first three steps rely on image
processing methods (such as Gaussian and median filters,
colour constancy and texture techniques).
In the Pre-processing step, image acquisition imperfections
(such as hairs, air bubbles and acquisition symbols) are
eliminated from each dermatoscopic image and the contrast
between the skin lesion and the surrounding skin is enhanced
[11].
The Segmentation block splits the dermatoscopic image
into two regions: one region associated with the skin lesion and
the other region associated with healthy skin [11]. The border
between these two regions is labelled the “contour of the
lesion”.
The third step focuses on the extraction of a set of features
allowing to distinguish melanoma from non-melanoma
lesions. It is important to mention that the term feature may
have two distinct meanings. On the one hand, it can represent
a dermatoscopic structure. On the other, it can be an image
characteristic, such as colour, texture, and geometry (e.g., size,
area, and perimeter) [30]. A CAD system can extract both
clinical and image features. Generally, following the feature
extraction step, a Feature Selection comes into action to
eliminate redundant features or features without meaningful
information about melanoma detection.
Finally, the Classification block predicts the skin lesion
diagnosis. There are two types of outputs: binary diagnosis
(two classes: benign and malign) and n-ary diagnosis (n
classes). The more accurate the set of extracted features, the
more correct is the classification [11]. The accuracy of the
extracted features set measures how well that set distinguishes
across skin lesions.
2.2.2 The importance of a large image dataset
A large enough database of real dermatoscopic images is a
crucial requirement to produce reliable automatic detections of
melanoma. Thus, image databases are required, and each
image must be segmented and annotated (i.e. labelled with the
diagnosis of the image lesion) by experienced dermatologists
[28]. When a database contains only a few images, it is
common to augment training data artificially by rotating,
flipping or producing non-linear distortions of the original
images, for example [1]. These data augmentation (DA)
procedures tend to improve the diagnostic accuracy of the
system [31].
2.2.3 CNN-based CAD systems: literature recension
In the past few years, DL has been applied to several
domains of study, namely natural language processing [32],
image processing (such as in object detection [33]) and medical
image analysis [34]. In the latter field, CNN have been widely
used as an automatic method because CNN can learn to extract
and classify digital images. For melanoma detection, CNN
may learn by themselves which features perform better to
distinguish melanoma from non-melanoma lesions; thus, this
method can accomplish an automated melanoma diagnosis
[1,13,35], whose diagnostic accuracy compares to the
dermatologist's diagnostic accuracy [13]. Hence, CNN-based
CAD systems may help dermatologists to diagnose melanoma
in its early stage of the clinical setting [13]. However, their
results are still not totally reliable. In addition, there are only a
few works using CNN in CAD systems due to the lack of
public databases with a sufficient number of dermatoscopic
images. The creation of a large public database for melanoma
detection only appeared in 2016 associated to the ISIC
Challenge. The latter was launched this year with the double
purpose of providing the community with the largest set of
publicly available high-quality dermatoscopic images [18] and
promoting an annual competition to develop automated
melanoma diagnosis algorithms. The following paragraphs
review particular studies that provide a case for CNN-based
CAD diagnostic systems.
Codella et al. proposed a system that became a new state-
of-the-art in 2017 [1]. The authors combined machine learning
techniques with neural networks to develop a CAD system that
performs three steps: lesion segmentation, feature extraction
and classification. The dataset was augmented. The
classification results for average precision reach 64.5% and for
the area under the receiver operator curve (ROC-AUC,
explained in Section 5.2.) achieve 83.8%.
Esteva et al. published a remarkable study by applying deep
neural networks to melanoma diagnosis [13]. The authors
employed a single CNN-based CAD system to perform three
classification tasks: differentiate across keratinocyte
carcinomas and benign seborrheic keratoses, distinguish across
melanoma and non-melanoma using only clinical images, and
distinguish across melanoma and non-melanoma using
dermatoscopic images [13]. For each task, the performance of
the CNN exceeded the average performance of 21 certified
dermatologists. In the first task this CNN reached a ROC-AUC
of 96%, in the second task it achieved 94%, and in the last one
attained 91%. However, the dataset for each task was different:
135 images, 130 clinical images (33 melanomas and 97 benign
nevi), and 111 dermatoscopic images (71 melanomas and 40
benign nevi), respectively.
Haenssle et al. presented in 2018 a study in the Annals of
Oncology (a medical journal of oncology) [34] that empower
the CNN diagnostic accuracy above the diagnostic accuracy of
a group of 58 dermatologists, similarly to the Esteva’s work.
The classification task only required the distinction between
melanoma and non-melanoma out of a 100- dermatoscopic
images set. The dermatologists were given two opportunities
to perform the clinical diagnosis, called level I and level II
diagnoses. In level I, they had to provide the diagnosis and
prescribe what the patient should do. Level II diagnoses were
conducted four weeks later; the same doctors had to classify
the same lesions and perform the same task, but at this moment,
they were given more information about the patient as well as
the chance of visualizing each lesion more closely. Even when
compared to level II results, CNN showed better results. While
in level I dermatologists had correctly identified 86,6% of all
melanomas (i.e., sensitivity, SE, of 86,6%) and 71,3% of all
benign nevi (i.e., specificity, SP, of 71,3%) and in level II they
achieved 88,9% of SE and 75,7% of SP, CNN distinguished
95% of melanomas. Hence, CNN outperformed dermatologists
on melanoma detection.
CNN-based CAD systems are thus a very promising
method for melanoma diagnosis. Their main objectives are to
help general practitioners to detect melanoma as early as
possible [13,35] and to avoid unnecessary excisions [12].
3 CONVOLUTIONAL NEURAL NETWORKS
3.1 Main Concepts
The proposed solution in this work uses neural networks and
CNN concepts. The explanation of the envisaged solution in
Section 4 will benefit from a brief introduction to these notions.
CNN are neural networks that apply convolution operations
to the input images. Neural networks consist of a set of neuron
layers. The concept of “artificial neurons” was introduced by
McCulloch and Pitts in 1943 [36]. A few others models of
neural networks were proposed in the following years. The
multilayer perceptron (MLP) stands out as it is the basis of
many neural networks models in current days. It consists of
several layers of neurons [37,38,39]. There are three types of
layers: input layer, hidden layer and output layer. An artificial
neuron involves two steps: i) a linear combination of inputs; ii)
a non-linear function is applied to activate the neuron so that
the signal can proceed to the next neuron. In CNN, the most
frequent activation function is the Rectified Non-Linear Unit
(ReLU) [19].
The training phase in a feed-forward neural network, such
as a CNN, is done in a supervised way, which consists of
computing the error (called “cost function”) between the
desired and the network outputs for a training set. The purpose
of the training step is to find a local minimum of the cost
function [40]. This minimisation is usually performed by the
gradient algorithm, which updates the weights (connections
between neurons) in each training iteration. This algorithm
works in two phases: forward propagation and
backpropagation. In the first one, the input signal is passed
from the input to the output layers through hidden layers, and
the error between the desired and the network outputs is
computed. The second phase features the calculation of the
gradient, which is normally performed by the Backpropagation
algorithm [ClassesofNeuralNet1998]. One way to accelerate
the training phase is to use more efficient gradient optimization
algorithms, such as Adam optimizer [41].
There are a few weight initialisation techniques [42], such
as the transfer learning. This technique consists of reusing
weights from an already trained network for a different, yet
similar, problem [43].
The training dataset must be large enough so that the
network does not memorise the training examples; otherwise,
overfitting will occur. If the dataset is small, a few
regularisation techniques may help to prevent overfitting [44],
like weight decay and dropout [45].
The classification or generalisation phase consists of
applying the trained model to classify a different dataset (test
set). If the network classifies correctly the majority of test
examples, then it has learned well during the training phase
[46].
3.2 CNN Architecture
Given an image dataset, CNN can learn by themselves the most
significant features to extract from each image. In their first
hidden layers, CNN typically extract low-level features, such
as edges, corners and curves, and then extract more and more
high-level features, such as an object or pieces of objects
[47,48].
Figure 1 depicts a CNN example2. Comparing with a MLP,
the fully-connected layers are converted into convolutional
layers in a CNN. In convolutional layers, there are 3D
convolution operations between the output of a previous layer
and a set of kernels/filters [48]. The result of a convolution is
called a “feature map”. After a convolutional layer, in which
neurons are activated through an activation function (e.g.
ReLU), there is a pooling layer. This layer is responsible for
reducing the spatial dimensions of the receiving signal by
splitting the input map into non-overlapping cells and
replacing each cell by a number (e.g., average number or
maximum [49]). After a sequence of convolutional and pooling
layers, there are a few fully-connected layers. While the former
focus on extracting the relevant features, the role of fully-
connected layers is the classification. The neurons of these
layers combine the extracted features and convert them into
class scores. This conversion is done by the softmax function,
instead of the ReLU function.
Figure 1 – A simple CNN architecture.
CNN have seen the light of glory days thanks to the
ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC) in 2012 [17] and to the development of Graphic
Processing Units (GPUs). That challenge focuses on
algorithms for object detection and image classification and
offers a huge public image dataset with 1000 object categories.
Until 2012, the winners of this contest presented hand-crafted
feature methods. Since that year, the winning algorithms have
been CNN and, hence, nowadays they are widely used in
several domains of study. AlexNet, VGG and ResNet are three
CNN presented in ILSVRC at the 2012, 2014 and 2015
editions [19,20,21], respectively, and each of them is used in
this work to build a CNN-based CAD system.
2 This figure was inspired by two other figures taken from:
www.adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-
Guide-To-Understanding-Convolutional-Neural-Networks e
www.medium.com/RaghavPrabhu/understanding-of-convolutional-
neural-network-cnn-deep-learning-99760835f148
3.3 Three particular CNN: AlexNet, VGG and
ResNet
This paper’s proposed detection system combines those three
CNN, and thus it seems useful to highlight in the following
paragraphs what they are about.
AlexNet [19] was the most promising CNN at the time,
reaching a winning top-5 error rate of 15,3%3. It contains five
convolutional (conv) and three fully-connected (fc) layers.
Three of conv layers are followed by one pooling layer. The
activation function is ReLU, except in the last fc layer that uses
the softmax function.
VGG [20] was not the winner in the 2014 edition, but this
CNN highlighted the importance of deepness in a CNN and
achieved a top-5 error rate of 7.3% with the deepest
configuration. VGG is a deep neural network and has a few
possible configurations. VGG-16 is one of them and contains
13 conv and three fc layers. The activation function is ReLU,
except in the last fc layer that uses the softmax function.
ResNet [21] was the winner in 2015 by attaining a top-5
error rate of 3.57% with the deepest configuration. Although
this CNN is more complex than VGG, it is less deep [21]
because of the introduction of the residual blocks, which allow
to skip some layers. There are also a few possible ResNet
configurations. ResNet-50 is one of them and it is described by
49 conv and one fc layers. The activation function is ReLU,
except in the last fc layer that uses the softmax function.
4 THE PROPOSED MELANOMA’S
DETECTION SYSTEM
4.1 Problem Formulation
The purpose of this paper is to present a CAD system for
melanoma detection. The idea is to distinguish across
melanoma and two benign skin lesions: melanocytic nevi and
seborrheic keratosis. The latter are non-melanocytic. The two
benign skin lesions are herein mentioned as “non-melanoma”,
because the system only identifies lesions as melanoma and
non-melanoma. It is not designed to differentiate between the
two types of benign lesions. Given a dermatoscopic image, the
system classifies it as a melanoma or as a non-melanoma.
Thus, this is a binary classification problem, in which the
training and test sets correspond to dermatoscopic images of
skin lesions.
The problem under study faces three difficulties. First, the
skin lesions, that the system has to differentiate, share a few
morphological features. Second, images in the database were
obtained from different image acquisition methods. Hence,
there are differences with respect to colour and luminance, as
well as imperfections in some images (e.g.: hairs and strange
objects) associated with the quality of the acquisition method
that may impair the results. These differences may lead to a
3 The top-5 error rate is the fraction of test images for which the correct
label does not belong to the top-5 labels’ list (the ones with the highest
probabilities).
misunderstanding of the relevant features to extract. Third, the
training set of the image database is an imbalanced set because
the number of benign examples is much higher than the
number of malignant examples. Therefore, it is essential to
develop a pre-processing step to mitigate these difficulties.
The image database was downloaded from the 2017 edition
of the International Skin Imaging Collaboration (ISIC)
challenge [18], as explained in Section 5.1.
4.2 System Architecture
Figure 2 presents the system built for automated melanoma
diagnosis.
Figure 2 – The proposed CAD system.
For each input dermatoscopic image, the system outputs a
decision: melanoma or non-melanoma. This system was
implemented for each CNN used. The following steps must be
ran before that decision is made. First, the full dataset is
preprocessed so that each image contains the skin lesion only.
Second, the training phase is executed. The preprocessed
training set is artificially augmentated and, then, the CNN is
trained to distinguish between melanoma and non-melanoma.
The performance of the CNN is evaluated by two metrics using
the validation set. At the end of the training, the best
configuration of the CNN is selected. Then, in the
classification phase, the system classifies each preprocessed
test image as melanoma or non-melanoma, using the best
configuration.
In the pre-processing block, all images are converted into
the CNN’s image format and then normalized. The pre-
processing modifications are:
Segmentation: this transformation calculates the binary
mask of each dermatoscopic image, but all images of
the database already contain a segmentation mask.
Thus, this phase was not implemented.
Cropping: each image is cropped around the skin mole.
First, the bounding box for each image is calculated
from the segmentation mask. Then, the crop is done
using the dimensions of the bounding box.
Resizing: the input image of each CNN has 224x224
pixels. Therefore, each image is resized to that size.
Normalization: In each CNN there is a specific required
normalization method. In AlexNet the average intensity
calculated from all images in the training set is
subtracted at each pixel. Then, each pixel is divided by
the standard deviation. In VGG and ResNet, images are
mean-subtracted, i.e., the average intensity of each
colour channel is subtracted at each pixel.
4 Presented in Section 3.3.
Data augmentation is frequently used for imbalanced
datasets due to the improvement of the classifier’s performance
and helps to prevent overfitting [50]. Hence, the set of
malignant lesion images was augmented by rotations and
vertical and horizontal flips since the training set is an
imbalanced set. Additionaly, a weight value was associated to
each type of the lesion taking into account the total number of
images of the new training set and the total number of images
of the corresponding type ((1) for benign lesions and (2) for
malign lesions).
𝑤𝑒𝑖𝑔ℎ𝑡𝑏𝑒𝑛𝑖𝑔𝑛 =#𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡
#𝑏𝑒𝑛𝑖𝑔𝑛 𝑖𝑚𝑎𝑔𝑒𝑠 𝑜𝑓 𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 (1)
𝑤𝑒𝑖𝑔ℎ𝑡𝑚𝑎𝑙𝑖𝑔𝑛 =#𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡
#𝑚𝑎𝑙𝑖𝑔𝑛 𝑖𝑚𝑎𝑔𝑒𝑠 𝑜𝑓 𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 (2)
The classification of skin lesions is performed three times
using one different CNN at a time: AlexNet [19], VGG [20],
ResNet [21]. The reasons for using these and not other
networks are their high-quality classification results on the
ILSVRC competition4 and the possibility of training these
networks with low-cost computing equipment.
However, the CNN models implemented here are a
modification of the original architectures. In AlexNet and
VGG, the last three fully-connected layers (with 4096, 4096
and 1000 units, respectively) are replaced with one fully-
connected layer containing two neurons only (since the
problem under study only needs two classes). The original
ResNet only contains one fully-connected layer of 1000 units.
Thus, this layer is replaced with another of the same type, but
with just two neurons. For each network, dermatoscopic
images have fixed dimensions and the dropout technique is
applied during the training phase.
The training block is extremely important and determines
how good each CNN performs in the classification step. This
block is performed in the same way for each CNN, except for
two training parameters: number of epochs and number of
subsets of the validation set. In the forward propagation step,
the network weights are initialised by transfer learning because
the training set is small. In the backpropagation step, the
gradient is defined by the Adam optimizer and the calculation
of the gradient is implemented with the Backpropagation
algorithm, in which the cost function is the cross-entropy (3)
since this is the most common cost function used for
classification problems [44].
𝐸(𝑦, 𝑝) = − ∑ 𝑦𝑖 log 𝑝𝑖𝑖 (3),
where i represents each class (in this context, 1 or 2 since it is
a binary classifier), y is a binary indicator that expresses if a
class i is the correct classification for each example and p the
predicted probabilities vector.
Since using all training examples in each iteration (batch
mode) is computationally slow, the training set is divided into
smaller sets (mini-batch mode). The dimension of each subset
is defined by the batch size. The best configuration of each
CNN is chosen with respect to one of two metrics measured in
the validation set: loss and balanced error. The first metric is
related to the cost function used. The second metric
corresponds to 1 − 𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦, where balanced
accuracy is the average accuracy of the two classes [51].
5 EXPERIMENTAL RESULTS
5.1. Image Database
The image database belongs to the 2017 edition of the ISIC
Challenge [18]. The dataset contains 2000 training images, 150
validation images and 600 test images. The training set consists
of 374 melanoma images and 1626 non-melanoma images
(from which 254 of image lesions are seborrheic keratoses).
After the data augmentation, the total number of malign lesions
is 1870. The validation set is split into 120 benign lesions and
30 melanoma lesions. Finally, the test set is decomposed into
483 benign lesions and 117 melanomas.
For each lesion image, a binary mask (i.e. the segmentation
images) is provided. These masks are essential in this work to
obtain the dimensions of bounding boxes that are used in the
cropping step of the pre-processing phase. The masks have the
same image dimensions of the associated lesion image. A
specialised doctor made the segmentation of this set either
manually or through semi-automated methods.
Besides the original skin lesions images, the dataset
contains the clinical diagnosis of each image, which indicates
if the lesion is benign (non-melanocytic or seborrheic
keratoses) or malign. But the proposed system only
discriminates melanoma from the other two types of benign
lesions.
5.2. Performance Metrics
There are four types of results of a binary classifier: true
positive (TP), true negative (TN), false positive (FP), and false
negative (FN).
TP: a positive example is correctly classified as
positive;
TN: a negative example is correctly classified as
negative;
FP: a negative example is classified as positive;
FN: a positive example is classified as negative.
For binary classifiers, there are a few performance
measures, such as sensitivity, specificity, area under the
Receiver Operating Characteristic (ROC) curve and balanced
accuracy (BACC) [52].
Sensitivity (SE) [52], or true positive rate, is the proportion
of correct classifications of positive cases taking into account
all positive cases.
Specificity (SP) [52], also known as a true negative rate,
quantifies the proportion of correct classifications of negative
cases taking into account the entire negative population.
The ROC curve [53] is a probability curve. It relates
graphically the SE and probability of false alarm
(corresponding to 1 − 𝑆𝑃) of the classification test. The area
under the ROC curve (ROC-AUC) represents the classifier’s
capacity of distinguishing between classes. The bigger the
ROC-AUC value, the better is the classifier’s capacity. Thus,
if ROC-AUC=1, the model distuinguishes perfectly between
the positive class and the negative class. But if ROC-AUC=0.5,
the model classifies positive examples as negative examples
and negative examples as positive examples.
The balanced accuracy (BACC) expresses the average
accuracy obtained for each of the classes [51]. It is used when
the dataset is imbalanced in one of the classes.
5.3. Description of Experimental Results
5.3.1. CNN Training
Each CNN has its own architecture and training convergence
rate. Therefore, the number of epochs for each CNN is
different: 1000 epochs for AlexNet, 600 for VGG and 500 for
ResNet. The CNN weights were previously trained with the
ImageNet database. In the proposed system, they are initialised
by transfer learning. Each CNN is trained with mini-batch
mode with batches of 46 images.
5.3.2. Hyperparameter Selection
Dropout and learning rate are the two hyperparameters
evaluated for each CNN in the proposed system. Additionaly,
the data augmentation technique is analised only for AlexNet.
For dropout, four probability values are tested: 0%, 30%,
50%, 70%. For learning rate, a fix rate of 1x10-4 and a non-fix
rate (which value varies from 100 to 100 epochs) are tested.
The selection of the best set of hyperparameters is made
based on the performance of each CNN in the validation set,
which is measured by the loss and balanced error. For each
measure, the network performance is evaluated by SE, SP and
BACC. Then, the best set of hyperparameters (i.e., the best
configuration) of each CNN are selected by the metric that
presents the best performance results.
AlexNet
The Figure 3 shows the convergence curves for the best
configuration of AlexNet: dropout=30%, a non-fixed learning
rate and with data augmentation.
The classification results of AlexNet under the data
augmentation technique are undoubtedly better than the results
obtained without data augmentation. The best configuration
with data augmentation reached SE=70%, SP=75%,
BACC=73%, while without data augmentation the CNN
attained SE=20%, SP=98%, BACC=59%. Thus, AlexNet
without the data augmentation technique is not able to
distinguish melanoma from non-melanoma. Furthermore, the
CNN merely converges and is influenced by overfitting.
In general, the results with a non-fixed learning rate are
better when using a fixed learning rate. The CNN converges,
Figure 3 – Best configuration of AlexNet on the validation set, given by the
balanced error metric (on the left) and the loss metric (on the right).
the overfitting is prevented and the oscilation frequency is
smoother. Thus, the classification of the AlexNet is more
precise.
The dropout accelerates the convergence of the CNN and
also helps to prevent overfitting. Under a fixed learning rate,
the bigger the dropout value, the better is the CNN’s SE. Under
a non-fixed learning rate, the effect of the dropout is not
significantly noticed on the classification results.
VGG
The Figure 4 shows the convergence curves for the best
configuration of VGG: dropout=70% and a non-fixed learning
rate with SE=87%, SP=73% and BACC=80%.
The best performance results of VGG are reached by
varying the learning rate, rather than using a fixed value. The
non-fixed learning rate reduces the amplitude of oscillation
between epochs, presents lower values for the convergence
curves and also prevents overfitting.
The VGG configurations obtained with dropout achieved a
faster convergence rate and a lower number of oscillations
(specially the oscillations of the loss curve) than the
experiences without dropout.
ResNet
The Figure 5 shows the convergence curves for the best
configuration of ResNet: dropout=50% and a non-fixed
learning rate with SE=80%, SP=78% and BACC=79%.
Comparing the performance results of ResNet by varying
the learnig rate with the ones of using a fixed value of learning
rate, the former results are slightly better. This variation of this
hyperparameter prevents overfitting and accelerates the CNN’s
convergence rate.
The best configurations of ResNet were obtained with
dropout (30% and 50%). By comparing the ResNet
experiences without dropout with ones influenced by dropout,
it is possible to analyse that the network converges faster.
Between the two best configurations, the oscillation frequency
of the network given by the probability value of 50% is
smoother than the network given by the probability value of
30%. Also, the convergence rate is higher.
5.4. Performance Evaluation After selecting the best configuration for each CNN, each
network is tested using the test set. Then, the CNN that best
suits the classification between melanoma and non-melanoma
is selected.
The classification results for each CNN is described in
Tables 1, 2 and 3. Figure 6 illustrates the ROC curve in the test
set for each network.
ResNet is the best CNN for melanoma detection. It presents
the best classification results, SE=79%, SP=60%, BACC=70%
and ROC-AUC=77% (Table 3). For the malign test set, this
network identified 93 lesions as malign and misclassified the
remaining 24 lesions as benign. VGG reached SE=78%,
SP=59%, BACC=68% and ROC-AUC=75% (Table 2).
AlexNet is the poorest one. On the one hand, its capacity to
distinguish melanomas from non-melanomas lies below 60%.
On the other, it attained SE=56%, SP=70%, BACC=63% and
ROC-AUC=68% (Table 1).
Figure 4 – Best configuration of VGG on the validation set, given by the
balanced error metric (on the left) and the loss metric (on the right).
Figure 5 – Best configuration of ResNet on the validation set, given by the
balanced error metric (on the left) and the loss metric (on the right).
Figure 6 – ROC Curves of AlexNet, VGG and ResNet, respectively.
Table 1 – Classification results of the best AlexNet configuration.
Table 2 – Classification results of the best VGG configuration.
Table 3 – Classification results of the best ResNet configuration.
Figures 7 and 8 show three images of malign lesions and
three of benign lesions (after the application of the bounding
box for each image), respectively: on the left side, there are
lesions correctly classified by ResNet, on the right side there
are the misclassifications performed by ResNet.
6 CONCLUSION AND FUTURE WORK
This section summarizes the main achievements and highlights
the main conclusions of this work. Also, it presents some
suggestions for future work.
The main purpose of this work was to develop a system for
automatic melanoma detection based on convolutional neural
networks (CNNs). The proposed system consisted of two
principal phases: training phase and classification phase. For
each one, the following three CNNs were used since they
require low-cost computing equipment: AlexNet, VGG and
ResNet.
The performance of each CNN was discussed and the best
CNN was selected. The classification results were categorised
by four well-known performance measures: sensitivity (SE),
specificity (SP), balanced accuracy (BACC) and area under the
ROC curve (ROC-AUC).
ResNet reached the best classification results by attaining
SE=79%, SP=60%, BACC=70% and ROC-AUC=77%. This
CNN presents the most complex architecture, but it needs less
number of epochs to converge when compared with VGG and
AlexNet. VGG was the network which required more
computational memory and, consequently, needed more time
to train. The AlexNet’s architecture is the simplest one.
Therefore, is the most suitable CNN if computational memory
and time are very restricted. However, this network needs a
higher number of epochs to converge and the classification
results are not so good. From a set of 117 malign lesions, this
CNN classified correctly 66 lesions and the remaing 51 were
misclassified as benign lesions.
The selection metrics evaluated on the validation set to
select the best configuration for each CNN were the loss and
balanced error. Although the comparison of these two metric
has not been the object of study in this work, the balanced error
was more robust and consistent. The best loss’s results
occurred mostly at the end of the training phase. Hence, those
results might be improved by overfitting, since this
phenomenon occurs in the last training epochs.
The image database was limited and imbalanced, since
there were very few images of malignat skin lesions. For these
two reasons, the data augmentation technique was applied on
the training set of malignant lesions.
The results obtained in this work show that it is in fact
possible to automatically distinguish skin lesions between
melanoma and non-melanoma. However, there is still room for
better results. Thus, some suggestions are herein presented as
future work.
First, it would be worthy to implement the proposed system
using one or more GPUs in order to accelerate the CNNs’
training, which is a time consuming when computing the CNN
hyperparameters. In addition, it would be very interesting to
either use more complex CNNs with better classification
results in image analysis problems or do some modifications in
CNNs which were implemented in studies of automatic
melanoma detection.
It would be interesting to explore other ways to enhance the
hyperparameters optimization. The selection of other
hyperparameters could be beneficial since it might boost
sensitivity of the CNN, thus, its melanoma detection
performance. The activation function, the batch size and the
weight initialisation are a few examples of different
hyperparameters.
On the other hand, it would be useful to evaluate the impact
on classification results having three classes, rather than two,
since the database still had a relevant number of seborrheic
keratoses.
It would be very useful to explore image databases that
contain an huge number of different dermatologic images and
are preferentially balanced.
Finally, it would be interesting to deeply study the subject
of selection metrics. On the one hand, understand which
selection metric is the most suitable to choose the best CNN
configuration. On the other, explore if there are other selection
metrics.
7 REFERENCES
[1] N. C. F. Codella, Q.-B. Nguyen, S. Pankanti, D. A. Gutman,
B. Helba, A. C. Halpern, and J. R. Smith, “Deep Learning
ensembles for melanoma recognition in dermoscopy images,”
Figure 7 – Melanoma examples: on the left, correctly classified; on the right,
misclassified as benign lesions.
Figure 8 – Non-melanoma examples: on the left, correctly classified; on the
right, misclassified as malign lesions.
IBM Journal of Research and Development, vol. 61, no. 4/5,
pp. 5:1–5:15, jul 2017.
[2] I. A. for Research on Cancer. (2018) Incidence, mortality
and prevalence by cancer site (world). [Online]. Available: https://gco.iarc.fr/today/data/factsheets/populations/840-united-
states-of-america-fact-sheets.pdf 2017.
[3] K. Korotkov, and R. Garcia, “Computerized analysis of
pigmented skin lesions: A review,” Artificial Intelligence in
Medicine, vol. 52, no. 2, pp. 69–90, oct 2012.
[4] Y. Li and L. Shen, “Skin Lesion Analysis towards
Melanoma Detection Using Deep Learning Network,” arXiv
preprint arXiv:1703.00577, vol. abs/1703.00577, vol. 18, fev
2018.
[5] S. Pathan, K. G. Prabhu, and P. C. Siddalingaswamy,
“Techniques and algorithms for computer aided diagnosis of
pigmented skin lesions – a review,” Biomedical Signal
Processing and Control, vol. 39, pp. 237–262, jan 2018.
[6] A. Victor and M. Ghalib, “Automatic detection and
classification of skin cancer,” International Journal of
Intelligent Engineering and Systems, vol. 10, no. 3, pp. 444-
451, jun 2017.
[7] A. C. Society, Cancer Facts & Figures 2018, 2018. [Online].
Available: https://www.cancer.org/content/dam/cancer-
org/research/cancer-facts-and-statistics/annual-cancer-facts-and-
figures/2018/cancer-facts-and-figures-2018.pdf
[8] M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W.
Menzies, “Dermoscopy compared with naked eye examination
for the diagnosis of primary melanoma: a meta-analysis of
studies performed in a clinical setting,” British Journal of
Dermatology, pp. 669-676, jun 2018.
[9] M. E. Celebi, Hassan A. Kingravi, B. Uddin, H. Iyatomi,
Y. A. Aslandogan, W. V. Stoecker, R. H. Moss, “A
methodological approach to the classification of dermoscopy
images,” Computerized Medical Imaging and Graphics, vol.
31, no. 6, pp. 362-373, sep 2017.
[10] J. F. Alcon, C. Ciuhu, W. Kate, A. Heinrich, N.
Uzunbajakava, G. Krekels, D. Siem. and G. Haan, “Automatic
Imaging System with Decision Support for Inspection of
Pigmented Skin Lesions and Melanoma Diagnosis,” IEEE Journal
of Selected Topics in Signal Processing, Institute of Electrical and
Electronics Engineers (IEEE), Computerized Medical Imaging
and Graphics, vol. 3, no. 1, pp. 14-25, feb 2009.
[11] M. E. Celebi, T. Mendonça, and J. S. Marques,
“Dermoscopy Image Analysis,” Taylor & Francis, 2015.
[12] J. Jaworek-Korjakowska, and P. Kleczek “Automatic
Classification of Specific Melanocytic Lesions using Artificial
Intelligence,” BioMed Research International, Hindawi
Limited, pp. 1-17, jun 2016.
[13] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H.
M. Blau, and S. Thrun, “Dermatologist-level classification of skin
cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp.
115-118, jan 2017.
[14] I. Maglogiannis, and C. N. Doukas, ”Overview of
Advanced Computer Vision Systems for Skin Lesions
Characterization,” IEEE Transactions on Information
Technology in Biomedicine, Institute of Electrical and
Electronics Engineers (IEEE), vol. 13, pp. 721-733, sep 2009.
[15] S. Dreiseitl, L. Ohno-Machado, H. Kittler, S. Vinterbo, H.
Billhardt, and M. Binder, “A Comparison of Machine Learning
Methods for the Diagnosis of Pigmented Skin Lesions,”
Journal of Biomedical Informatics, Elsevier BV, vol. 34, no. 1,
pp. 28-36, feb 2001.
[16] A. Masood, and A. A. Al-Jumaily, “Computer Aided
Diagnostic Support System for Skin Cancer: A Review of
Techniques and Algorithms,” International Journal of
Biomedical Imaging, pp. 1-22, 2013.
[17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C.
Berg, and L. Fei-Fei, “ImageNet Large Scale Visual
Recognition Challenge,” arXiv preprint arXiv:1409.0575, vol.
abs/1409.0575, 2014.
[18] N. C. F. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A.
Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H.
Kittler, and A. Halpern, “Skin lesion analysis toward melanoma
detection: A challenge at the international symposium on
biomedical imaging (ISBI) 2016, hosted by the international skin
imaging collaboration (ISIC),” arXiv preprint arXiv:1605.01397,
vol. abs/1605.01397, 2016
[19] A.Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet
Classification with Deep Convolutional Neural Networks,” in
Advances in Neural Information Processing Systems 25, F.
Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.
Curran Associates, Inc., 2012, pp. 1097–1105, 2012, ILSVRC
2012 winner.
[20] K. Simonyan, and A. Zisserman, “Very Deep Convolutional
Networks for Large-Scale Image Recognition,” arXiv preprint
arXiv:1409.1556, vol. abs/1409.1556, 2014, ILSVRC 2014.
[21] K. He, X. Zhang, S. Ren, J. Sun, and M. Ghalib, “Deep
Residual Learning for Image Recognition,” arXiv preprint
arXiv:1512.03385, vol. abs/1512.03385, 2015, ILSVRC 2015
winner.
[22] D. Trichopoulos, F. P. Li, and D. J. Hunter, “What causes
cancer?,” Scientific American, vol. 275, no. 3, pp. 80–87, sep
1996.
[23] O. Abuzaghleh, B. D. Barkana, and M. Faezipour,
“Noninvasive real-time automated skin lesion analysis
system for melanoma early detection and prevention,” IEEE
Journal of Translational Engineering in Health and Medicine,
vol. 3, pp. 1–12, 2015.
[24] Dermoscopy Tutorial. DS ME S.r.l. [Online]. Available: http://www.dermoscopy.org
[25] M.-L. Bafounta, A. Beauchet, P. Aegerter, and P. Saiag,
“Is dermoscopy (epiluminescence microscopy) useful for the
diagnosis of melanoma?,” Archives of Dermatology, vol. 137,
no. 10, oct 2001.
[26] G. Argenziano, G. Fabbrocini, P. Carli, V. D. Giorgi, E.
Sammarco, and M. Delfino, “Epiluminescence microscopy for
the diagnosis of doubtful melanocytic skin lesions,” Archives
of Dermatology, vol. 134, pp. 1563–1570, dec 1998.
[27] J. Baleiras, “Detecção de Melanomas usando Métodos de
Aprendizagem Profunda,” MSC Thesis, Instituto Superior
Técnico, pp. 11–14, jun 2019.
[28] T. Mendonca, P. M. Ferreira, J. S. Marques, A. R. S.
Marcal, and J. Rozeira, “PH2 - a dermoscopic image database
for research and benchmarking,” in 2013 35th Annual
International Conference of the IEEE Engineering in Medicine
and Biology Society (EMBC), IEEE, jul 2013, pp. 5437–5440.
[29] G. D. Leo, A. Paolillo, P. Sommella, and G. Fabbrocini,
“Automatic diagnosis of melanoma: A software system based
on the 7-point check-list,” in 2010 43rd Hawaii International
Conference on System Sciences, IEEE, jan 2010, pp. 1-10.
[30] A. Masood and A. A. Al-Jumaily, “Computer aided
diagnostic support system for skin cancer: A review of
techniques and algorithms,” International Journal of
Biomedical Imaging, pp. 1–22, 2013.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net:
Convolutional networks for biomedical image segmentation,”
Computer Science Department and BIOSS Centre for
Biological Signalling Studies, University of Freiburg,
Germany, 2015.
[32] R. Muñoz and A. Montoyo, “Advances on natural
language processing,” Data & Knowledge Engineering, vol.
61, pp. 403–405, jun 2007.
[33] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov,
“Scalable object detection using deep neural networks,” in
2014 IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, jun 2014, pp. 2155-2162.
[34] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I.
Sánchez, “A survey on deep learning in medical image analysis,”
Medical Image Analysis, vol. 42, pp. 60–88, dec 2017.
[35] H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T.
Buhl, A. Blum, A. Kalloo, A. B. H. Hassen, L. Thomas, A. Enk,
and L. Uhlmann, “Man against machine: diagnostic performance
of a deep learning convolutional neural network for dermoscopic
melanoma recognition in comparison to 58 dermatologists,”
Annals of Oncology, Oxford University Press (OUP) vol. 29, pp.
1836-1842, may 2018.
[36] W. S. McCulloch and W. Pitts, “A logical calculus of the
ideas immanent in nervous activity,” The Bulletin of Mathematical
Biophysics, vol. 5, pp. 115–133, dec 1943.
[37] C. V. D. Malsburg, “Frank rosenblatt: Principles of
neurodynamics: Perceptrons and the theory of brain
mechanisms,” in Brain Theory, Springer Berlin Heidelberg, 1986,
pp. 245–248.
[38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
“Learning internal representations by error propagation,” in
Parallel distributed processing: Explorations in the
microstructure of cognition, vol. 1, MIT Press, pdp research
group ed., 1986, vol. 1.
[39] R. Hecht-Nielsen, “Counterpropagation networks,”
Applied Optics, vol. 26, no. 23, p. 4979, dec 1987.
[40] D. Svozil, V. Kvasnicka, and J. Pospichal, “Introduction
to multi-layer feed-forward neural networks,” Chemometrics
and Intelligent Laboratory Systems, vol. 39, pp. 43–62, nov
1997.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” 3rd International Conference for Learning
Representations, San Diego, 2015
[42] G. Thimm and E. Fiesler, “High-order and multilayer
perceptron initialization,” IEEE Transactions on Neural
Networks, vol. 8, pp. 349–359, mar 1997.
[43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How
transferable are features in deep neural networks?,” in
Advances in Neural Information Processing Systems 27, Z.
Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q.
Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3320–
3328.
[44] S. Scardapane, D. Comminiello, A. Hussain, and A.
Uncini, “Group sparse regularization for deep neural
networks,” Neurocomputing, vol. 241, pp. 81–89, jun 2017.
[45] M. D. Zeiler and R. Fergus, “Stochastic pooling for
regularization of deep convolutional neural networks,” in Proceedings of the International Conference on Learning
Representation (ICLR), 2013.
[46] D. E. Rumelhart, R. Durbin, R. Golden, and Y. Chauvin,
Developments in connectionist theory. Backpropagation:
Theory, architectures, and applications. Y. Chauvin and D. E.
Rumelhart, Eds. Lawrence Erlbaum Associates, 1995.
[47] I. Goodfellow, Y. Bengio, and A. Courville, Deep
Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org
[48] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, no. 7553, pp. 436–444, may 2015.
[49] D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for
convolutional neural networks,” in Rough Sets and Knowledge
Technology. Springer International Publishing, 2014, vol. 8818,
pp. 364–375.
[50] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnel,
“Understanding data augmentation for classification: When to
warp?” in 2016 International Conference on Digital Image
Computing: Techniques and Applications (DICTA),” nov 2016,
pp. 1-6.
[51] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M.
Buhmann, “The balanced accuracy and its posterior
distribution,” in 2010 20th International Conference on
Pattern Recognition, IEEE, aug 2010, pp. 3121-3124.
[52] W. Zhu, N. Zeng, and N. Wang, “Sensitivity, specificity,
accuracy, associated confidence interval and roc analysis with
practical sas R implementations,” NorthEast SAS Users
Group, Health Care and Life Sciences, 2010. [Online].
Available: https://www.lexjansen.com/nesug/nesug10/hl/hl07.pdf
[53] C. E. Metz, “Basic principles of ROC analysis,” Seminars
in Nuclear Medicine, vol. 8, no. 4, pp. 283–298, oct 1978.