comparison of different classification algorithms in clinical decision-making
TRANSCRIPT
Comparison of different classificationalgorithms in clinical decision-making
Elif Derya UbeyliDepartment of Electrical and Electronics Engineering, Faculty of Engineering, TOBBEkonomi ve Teknoloji Universitesi, 06530 Sogutozu, Ankara, TurkeyE-mail: [email protected]
Abstract: This paper gives an integrated view of implementing automated diagnostic systems for clinical
decision-making. Because of the importance of making the right decision, better classification procedures are
necessary for clinical decisions. The major objective of the paper is to be a guide for readers who want to develop
an automated decision support system for clinical practice. The purpose was to determine an optimum
classification scheme with high diagnostic accuracy for this problem. Several different classification algorithms
were tested and benchmarked for their performance. The performance of the classification algorithms is
illustrated on two data sets: the Pima Indians diabetes and the Wisconsin breast cancer. The present research
demonstrates that the support vector machines achieved diagnostic accuracies which were higher than those of
other automated diagnostic systems.
Keywords: classification algorithms, automated diagnostic systems, clinical decision-making, diag-nostic accuracy
1. Introduction
Artificial neural networks (ANNs) are valuable
tools in the medical field for the development of
decision support systems. Important tools in
modern decision-making, in any field, include
those that allow the decision-maker to assign an
object to an appropriate group or classification.
Clinical decision-making is a challenging, multi-
faceted process. Its goals are precision in diag-
nosis and institution of efficacious treatment.
Achieving these objectives involves access to
pertinent data and application of previous
knowledge to the analysis of new data in order
to recognize patterns and relations. Practi-
tioners apply various statistical techniques in
processing data to assist in clinical decision-
making and to facilitate the management of
patients. As the volume and complexity of data
have increased, use of digital computers to
support data analysis has become a necessity.
In addition to computerization of standard
statistical analysis, several other techniques for
computer-aided data classification and reduc-
tion, generally referred to as ANNs, have
evolved (Miller et al., 1992; Itchhaporia et al.,
1996; Tafeit & Reibnegger, 1999; Basheer &
Hajmeer, 2000).
On analysing recent developments, it becomes
clear that the trend is to develop new methods
for computer decision-making in medicine and
to evaluate these methods critically in clinical
practice. ANNs have been used in different
medical diagnoses and the results have been
compared with physicians’ diagnoses and exist-
ing classification methods (Setiono, 1996, 2000;
Shanker, 1996; Lim et al., 1997; Mobley et al.,
2000; West & West, 2000; Park & Edington,
Article _____________________________
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 17
2001; Guler & Ubeyli, 2003; Ubeyli & Guler,
2003). Many of these researchers found that
ANNs have more flexibility in modelling and
reasonable accuracy in prediction. What makes
neural networks a promising tool is their capa-
city to find near-optimum solutions from limited
or incomplete data sets and the fact that learn-
ing is accomplished through training. In addi-
tion to these characteristics, it has been shown
that neural networks can combine data of a
different nature in one system, such as data
derived from clinical protocols, laboratory data
obtained from measurements and features from
signals and images, thus forming an inte-
grated diagnostic system (Miller et al., 1992;
Itchhaporia et al., 1996; Tafeit & Reibnegger,
1999; Basheer & Hajmeer, 2000).
The main concept of medical technology is an
inductive engine that learns the decision char-
acteristics of the diseases and can then be used to
diagnose future patients with uncertain disease
states. A number of quantitative models in-
cluding multilayer perceptron neural networks
(MLPNNs), combined neural networks
(CNNs), mixture of experts (ME), probabilistic
neural networks (PNNs), recurrent neural net-
works (RNNs) and support vector machines
(SVMs) are being used in medical diagnostic
support systems to assist human decision-
makers in disease diagnosis (Kordylewski et al.,
2001; Kwak & Choi, 2002; Ubeyli & Guler,
2005). Unfortunately, there is no theory avail-
able to guide an intelligent choice of model
based on the complexity of the diagnostic task.
In most situations, developers are simply pick-
ing a single model that yields satisfactory re-
sults, or they are benchmarking a small subset of
models with cross-validation estimates on test
sets (Kordylewski et al., 2001; Kwak & Choi,
2002; Ubeyli & Guler, 2005). Figure 1 shows the
various stages followed for the design of a
classification system. As is apparent from the
feedback arrows, these stages are not indepen-
dent. On the contrary, they are interrelated and,
depending on the results, one may go back to
redesign earlier stages in order to improve the
overall performance. Feedforward neural net-
works are a basic type of ANNs capable of
approximating generic classes of functions, in-
cluding continuous and integrable functions. An
important class of feedforward neural networks
is MLPNNs. The MLPNNs have features such
as the ability to learn and generalize, smaller
training set requirements, fast operation and
ease of implementation, and therefore they are
the most commonly used neural network archi-
tectures (Haykin, 1994; Basheer & Hajmeer,
2000; Chaudhuri & Bhattacharya, 2000). The
economic and social values of diabetes and
breast cancer are very high. As a result, these
problems have attracted many researchers in the
area of computational intelligence recently
(Setiono, 1996, 2000; Shanker, 1996; West &
West, 2000; Park & Edington, 2001). They
managed to achieve significant results by using
different classifiers. In this paper, a comparison
of several different classifiers, which were
trained on the attributes of each record in the
diabetes and breast cancer databases, is consid-
ered. Among these are the CNN, ME, PNN,
RNN and SVM. The performances of these
systems were then compared with that of the
MLPNN. Significant improvement in accuracy
was achieved by using the SVM compared to the
other classification algorithms.
2. Database overview
Diabetes is a metabolic disease in which there is
a deficiency or absence of insulin secretion by
Sensor Feature extraction
Featureselection
Classifierdesign
Systemevaluation
Patterns
Figure 1: The basic stages involved in the design of a classification system.
18 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
the pancreas. Diabetes occurs in two major
forms: type I, or insulin-dependent diabetes,
and type II, or non-insulin-dependent diabetes.
Most studies of the age of onset of type I
diabetes have been restricted to children and
adults under 30 years old. The age distributions
in females and males show small differences
which have been most clearly demonstrated in
surveys in children and young adults. Both
genders are affected, but in many communities
the majority with type II diabetes are female.
The prevalence of type II diabetes rises with
increasing age and in many populations the
majority of those with diabetes are either
middle-aged or elderly. Diabetes may be consid-
ered as a disorder of the metabolic disposal of
food. The interaction of food and diabetic state
must be assessed from two aspects: first, whether
food precipitates the diabetic condition and,
second, the type of food that is most appropriate
for the person with established diabetes,
whether it is insulin-dependent or non-insulin-
dependent. Diabetes shows considerable famil-
ial aggregation which may result from either the
inheritance of disease susceptibility or the shar-
ing of a common environment by members of
the same family. Therefore, it is important to
determine the existence of diabetes in families of
the subjects. The occurrence of gestational dia-
betes, which either first appears or is first recog-
nized during pregnancy, is associated with
increased risk for the development of diabetes
in subsequent years. Thus, women who experi-
ence gestational diabetes should be considered
a high risk group for the development of type II
diabetes (Besser et al., 1988). In this study,
the Pima Indians diabetes database (http://
www.cormactech.com/neunet) was analysed.
The data consist of 768 records and according
to the examination results 268 of them are
diabetics and the rest of them are non-diabetics.
Each record has eight attributes and these are
detailed in Table 1. Eight independent input
parameters, essentially risk factors for diabetes,
are incorporated in the classifiers.
Breast cancer is a malignant tumour that has
developed from cells of the breast. Although
scientists know some of the risk factors (i.e.
ageing, genetic risk factors, family history, men-
strual periods, not having children, obesity) that
increase a woman’s chance of developing breast
cancer, they do not yet know what causes most
breast cancers or exactly how some of these risk
factors cause cells to become cancerous. Re-
search is under way to learn more and scientists
are making great progress in understanding
how certain changes in DNA can cause normal
breast cells to become cancerous (Jerez-
Aragones et al., 2003). In this study, the Wiscon-
sin breast cancer database taken from fine
needle aspirates from human breast tissue
was analysed. The aspirates were collected by
Wolberg and Mangasarian (1990) at the Uni-
versity of Wisconsin-Madison Hospitals. The
data consist of 683 records of virtually assessed
nuclear features of fine needle aspirates taken
from patients’ breasts. Each record in the data-
base has nine attributes. The nine attributes
Table 1: Pima Indians diabetes database: description of attributes
Attribute number Attribute description Mean Standard deviation
1 Number of times pregnant 3.8 3.42 Plasma glucose concentration at
2 h in an oral glucose tolerance test120.9 32.0
3 Diastolic blood pressure (mmHg) 69.1 19.44 Triceps skin fold thickness (mm) 20.5 16.05 2-hour serum insulin (mU=ml) 79.8 115.26 Body mass index (weight in kg=(height in m)2) 32.0 7.97 Diabetes pedigree function 0.5 0.38 Age (years) 33.2 11.8
N¼ 768 observations, 268 diabetics and 500 non-diabetics.
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 19
detailed in Table 2 are graded on an interval
scale from a normal state of 1 to 10, with 10
being the most abnormal state. There are 239
malignant cases and 444 benign cases. A malig-
nant label is confirmed by performing a biopsy
on the breast tissue. Either a biopsy or a periodic
examination is used to confirm a benign label.
3. Brief review of different automated
diagnostic systems
3.1. CNN
The CNN models often result in a prediction
accuracy that is higher than that of the indivi-
dual models. This construction is based on a
straightforward approach that has been termed
stacked generalization (Figure 2). Training data
that are difficult to learn usually demonstrate
high dispersion in the search space due to the
inability of the low-level measurement attributes
to describe the concept concisely. Because of the
complex interactions among variables and the
high degree of noise and fluctuations, a signifi-
cant number of data used for applications are
naturally available in representations that are
difficult to learn. The degree of difficulty in
training a neural network is inherent in the given
set of training examples. By developing a tech-
nique for measuring this learning difficulty, a
feature construction methodology is devised
that transforms the training data and attempts
to improve both the classification accuracy and
computational times of ANN algorithms. The
fundamental notion is to organize data by
intelligent preprocessing, so that learning is
facilitated (Wolpert, 1992; Guler & Ubeyli,
2005a). The stacked generalization concepts
formalized by Wolpert (1992) predate these
ideas and refer to schemes for feeding informa-
tion from one set of generalizers to another
before forming the final predicted value (out-
put). The unique contribution of stacked gener-
alization is that the information fed into the net
of generalizers comes from multiple partition-
ings of the original learning set. The stacked
generalization scheme can be viewed as a more
sophisticated version of cross-validation and
has been shown experimentally to effectively
improve the generalization ability of ANN
models over individual neural networks. The
MLPNNs were used at the first level and second
level for the implementation of the CNNs pro-
posed in this study.
3.2. ME
The ME architecture is composed of a gating
network and several expert networks (Figure 3).
The gating network receives the vector x as
input and produces scalar outputs that are a
partition of unity at each point in the input
space. Each expert network produces an output
vector for an input vector. The gating network
provides linear combination coefficients as ver-
idical probabilities for expert networks, and
therefore the final output of theME architecture
Table 2: Wisconsin breast cancer database: description of attributes
Attribute number Attribute description Minimum Maximum Mean Standard deviation
1 Clump thickness 1 10 4.44 2.822 Uniformity of cell size 1 10 3.15 3.073 Uniformity of cell shape 1 10 3.22 2.994 Marginal adhesion 1 10 2.83 2.865 Single epithelial cell size 1 10 3.23 2.226 Bare nuclei 1 10 3.54 3.647 Bland chromatin 1 10 3.45 2.458 Normal nucleoli 1 10 2.87 3.059 Mitoses 1 10 1.60 1.73
N¼ 683 observations, 239 malignant and 444 benign.
20 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
is a convex weighted sum of all the output
vectors produced by the expert networks.
Suppose that there are N expert networks in the
ME architecture. All the expert networks are
linear with a single output non-linearity that is
also referred to as ‘generalized linear’. The ith
expert network produces its output oi(x) as
a generalized linear function of the input x
(Jacobs et al., 1991; Chen et al., 1999; Hong &
Harris, 2002):
oiðxÞ¼ f ðW ixÞ ð1Þ
where Wi is a weight matrix and f ð�Þ is a fixed
continuous non-linearity. The gating network is
also a generalized linear function, and its ith
output, gðx; viÞ, is the multinomial logit or
Output 1 Output 2 Output 3 Output j
Multilayer perceptron neural network
Output 1 Output 2 Output 3 Output j
Hidden layer 1 neuronsh = 1, 2, ..., k
Hidden layer N neurons h = 1, 2, ..., m
Output layer neuronso = 1, 2, ..., j
1st level
2nd level
Figure 2: CNN architecture.
Expert Network
1
Expert Network
N
(x)O
X X
Gating Network
X
Figure 3: Architecture of the ME.
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 21
softmax function of intermediate variables xi:
gðx; viÞ¼exiPN
k¼ 1 exk
ð2Þ
where xi ¼ vTi x and vi is a weight vector. The
overall output o(x) of the ME architecture is
oðxÞ¼XNk¼ 1
gðx; vkÞ okðxÞ ð3Þ
The ME architecture can be given a probabil-
istic interpretation. For an input–output pair (x,
y), the values of gðvi; xÞare interpreted as the
multinomial probabilities associated with the
decision that terminates in a regressive process
that maps x to y. Once the decision has been
made, resulting in a choice of regressive process
i, the output y is then chosen from a probability
density Pðy x;W ij Þ, where Wi denotes the set of
parameters or weight matrix of the ith expert
network in the model. Therefore, the total
probability of generating y from x is a mixture
of the probabilities of generating y from each
component density, where the mixing propor-
tions are multinomial probabilities:
Pðy x;Fj Þ¼XNk¼ 1
gðx; vkÞPðy x;Wkj Þ ð4Þ
where F is the set of all the parameters including
both expert and gating network parameters.
Based on the probabilistic model, learning in
the ME architecture is treated as a maximum
likelihood problem. Jordan and Jacobs (1994)
have proposed an expectation maximization
(EM) algorithm for adjusting the parameters of
the architecture. In this framework a number of
relatively small expert networks can be used
together with a gating network designed to
divide the global classification task into simpler
subtasks (Guler & Ubeyli, 2005b).
3.3. PNN
The PNN was first proposed by Specht (1990).
A single PNN is capable of handling multiclass
problems. This is opposite to the so-called one-
against-the-rest or one-per-class approach taken
by some classifiers, such as the SVM, which
decompose a multiclass classification problem
into dichotomies and each chotomizer has to
separate a single class from all others. The
architecture of a typical PNN is shown in Figure
4. The PNN architecture is composed of many
interconnected processing units or neurons or-
ganized in successive layers. The input layer unit
does not perform any computation and simply
distributes the input to the neurons in the
pattern layer. On receiving a pattern x from the
input layer, the neuron xij of the pattern layer
computes its output:
fijðxÞ¼1
ð2pÞd=2sdexp �ðx� xijÞTðx� xijÞ
2s2
" #
ð5Þ
where d denotes the dimension of the pattern
vector x, s is the smoothing parameter and xij is
the neuron vector. The summation layer neu-
rons compute the maximum likelihood of pat-
tern x being classified into Ci by summing and
averaging the output of all neurons that belong
to the same class:
piðxÞ
¼ 1
ð2pÞd=2sd1
Ni
XNi
j¼ 1
exp �ðx� xijÞTðx� xijÞ2s2
" #
ð6Þ
whereNi denotes the total number of samples in
class Ci. If the a priori probabilities for each
class are the same, and the losses associated with
making an incorrect decision for each class are
the same, the decision layer unit classifies the
pattern x in accordance with Bayes’s decision
rule based on the output of all the summation
layer neurons:
CðxÞ¼ arg max piðxÞ½ � i¼ 1; 2; . . . ; m ð7Þ
where CðxÞdenotes the estimated class of the
pattern x and m is the total number of classes in
the training samples (Specht, 1990; Burrascano,
1991).
22 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
3.4. RNN
RNNs have been used in a number of interesting
applications including associative memories,
spatiotemporal pattern classification, control,
optimization, forecasting and generalization of
pattern sequences (Petrosian et al., 2000; Shieh
et al., 2004). Fully recurrent networks use un-
constrained fully interconnected architectures
and learning algorithms that can deal with
time-varying input and=or output in non-trivial
ways. In spite of several modifications of learn-
ing algorithms to reduce the computational
expense, fully recurrent networks are still com-
plicated when dealing with complex problems.
Therefore, partially recurrent networks, whose
connections are mainly feedforward, are used
but they include a carefully chosen set of feed-
back connections. The recurrence allows the
network to remember cues from the past with-
out complicating the learning excessively. The
structure proposed by Elman (1990) is an illus-
tration of this kind of architecture. An Elman
RNN was used in this application and therefore
in the following the Elman RNN is presented.
An Elman RNN is a network which in princi-
ple is set up as a regular feedforward network.
This means that all neurons in one layer are
connected with all neurons in the next layer. An
exception is the so-called context layer which is a
special case of a hidden layer. Figure 5 shows the
architecture of an Elman RNN. The neurons in
the context layer (context neurons) hold a copy
of the output of the hidden neurons. The output
of each hidden neuron is copied into a specific
neuron in the context layer. The value of the
context neuron is used as an extra input signal
for all the neurons in the hidden layer one time
step later. Therefore the Elman network has an
explicit memory of one time lag (Elman, 1990).
Similar to a regular feedforward neural net-
work, the strength of all connections between
neurons are indicated with a weight. Initially, all
weight values are chosen randomly and are
optimized during the stage of training. In an
Elman network, the weights from the hidden
layer to the context layer are set to one and are
fixed because the values of the context neurons
have to be copied exactly. Furthermore, the
initial output weights of the context neurons
x
x x x x x xx x x
)(xp )(xp )(xp
)(xC
Patternlayer
Input layer
Summationlayer
Decisionlayer
Figure 4: Architecture of the PNN.
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 23
are equal to half the output range of the other
neurons in the network.
3.5. SVM
The SVM proposed by Vapnik (1995) has been
studied extensively for classification, regression
and density estimation. Figure 6 shows the
architecture of the SVM. The SVM maps the
input patterns into a higher dimensional feature
space through some non-linear mapping chosen
a priori. A linear decision surface is then con-
structed in this high-dimensional feature space.
Thus, the SVM is a linear classifier in the
z zz
x x x
y1 y2 yn
Output layer
Hiddenlayer
Input layer
Context layer
Figure 5: A schematic representation of an Elman RNN; z�1 represents a one-time-step delay unit.
Inputs
)(⋅K
)(⋅K
w
w
w
Output
b
∑
+1
–1
)(⋅K
Figure 6: Architecture of the SVM (N is the number of support vectors).
24 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
parameter space, but it becomes a non-linear
classifier as a result of the non-linear mapping of
the space of the input patterns into the high-
dimensional feature space. Training the SVM is
a quadratic optimization problem. The con-
struction of a hyperplane wTxþ b¼ 0 (w is the
vector of hyperplane coefficients, b is a bias
term) so that the margin between the hyperplane
and the nearest point is maximized can be
posed as the quadratic optimization problem.
The SVM has been shown to provide
high generalization ability. For a two-class
problem, assuming the optimal hyperplane in
the feature space is generated, the classification
decision of an unknown pattern y will be made
based on
f ðyÞ¼ sgnXNi¼ 1
aiyiKðxi; yÞ þ b
" #ð8Þ
where aiZ0, i¼ 1, 2, . . ., N, are non-nega-
tive Lagrange multipliers that satisfyPNi¼ 1 aiyi ¼ 0; yi yi 2 �1; þ1f gjf gNi¼ 1 are class
labels of training patterns xi xi 2 RN��� �N
i¼ 1, and
Kðxi; yÞ for i¼ 1, 2, . . ., N represents a sym-
metric positive definite kernel function that
defines an inner product in the feature space.
This shows that f(y) is a linear combination of
the inner products or kernels. The kernel func-
tion enables the operations to be carried out in
the input space rather than in the high-dimen-
sional feature space. Some typical examples of
kernel functions are Kðu; vÞ¼ vTu (linear SVM),
Kðu; vÞ¼ ðvTuþ 1Þn (polynomial SVM of degree
n), Kðu; vÞ¼ expð� u� vk k2=2s2Þ (radial basis
function SVM), and Kðu; vÞ¼ tanhðkvTyþ yÞ(two-layer neural SVM), where s, k and y are
constants (Cortes & Vapnik, 1995; Vapnik,
1995). However, a proper kernel function
for a certain problem is dependent on the
specific data and so far there is no good method
on how to choose the kernel function. In
this study, the choice of the kernel function
was studied empirically and optimal results
were achieved using the radial basis kernel
function.
4. Results
The classifiers proposed for clinical decision-
making were implemented by using the MAT-
LAB software package (MATLAB version 7.0
with neural networks toolbox). The attributes of
diabetes and breast cancer detailed in Tables 1
and 2 were used as the inputs of the classifiers.
The key design decisions for the neural networks
used in classification are the architecture and the
training process. The architectures of the CNN,
ME, PNN, RNN and SVM used for prediction
of diabetes and breast cancer are shown in
Figures 2–6, respectively. The adequate func-
tioning of neural networks depends on the sizes
of the training set and test set. To comparatively
evaluate the performance of the classifiers, all
the classifiers presented in this study were
trained by the same training data set and tested
with the evaluation data set. There are a total of
768 records in the Pima Indians diabetes data-
base, of which 268 are diabetics and 500 are
non-diabetics. In the classifiers, 284 of 768
records were used for training and the rest for
testing. The training set consisted of 136 dia-
betics and 148 non-diabetics. The testing set
consisted of 132 diabetics and 352 non-
diabetics. There are a total of 683 records in the
Wisconsin breast cancer database, of which 444
are benign records and 239 are malignant re-
cords. In the classifiers, 250 of 683 records were
used for training and the rest for testing. The
training set consisted of 80 malignant records
and 170 benign records. The testing set consisted
of 159 malignant records and 274 benign re-
cords.
The training algorithm of the SVMs, based on
quadratic programming, incorporates several
optimization techniques such as decomposition
and caching. The quadratic programming prob-
lem in the SVMs was solved by using the
MATLAB optimization toolbox. For the imple-
mentation of the SVMs with the radial basis
kernel functions, one has to assume a value for
s. The optimal s can only be found by system-
atically varying its value in the different training
sessions. To do this, the support vectors were
extracted from the training data file with an
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 25
assumed s value. After the support vectors had
been found and SVMs had been constructed, the
model was applied to a third of the evaluation
data set to compute the misclassification rate.
The s value was varied between 0.1 and 0.6, at
an interval of 0.1. Values of s¼ 0.3 for the
diabetes database and s¼ 0.4 for the breast
cancer database resulted in the minimum mis-
classification rate and were thus chosen. The
generalization ability of the SVMs is controlled
by two different factors: the training error rate
and the capacity of the learning machine mea-
sured by its Vapnik–Chervonenkis (VC) dimen-
sion (Vapnik, 1995). The smaller the VC
dimension of the function set of the learning
machine, the larger the value of the training
error rate. The tradeoff between the complexity
of the decision rule and the training error rate
can be controlled by changing a parameter C
(Cortes & Vapnik, 1995) in the SVMs. The
SVMs were trained for different C values until
the best results were obtained: C¼ 90 for the
diabetes database and C¼ 80 for the breast
cancer database in the testing procedure.
The Elman network can be trained with
gradient descent backpropagation and optimi-
zation methods, similar to regular feedforward
neural networks (Pineda, 1987). Backpropaga-
tion has some problems for many applications.
The algorithm is not guaranteed to find the
global minimum of the error function since
gradient descent may get stuck in local minima,
where it may remain indefinitely. In addition to
this, long training sessions are often required in
order to find an acceptable weight solution
because of the well-known difficulties inherent
in gradient descent optimization (Haykin,
1994). Therefore many variations to improve
the convergence of the backpropagation were
proposed. Optimization methods such as sec-
ond-order methods (conjugate gradient, quasi-
Newton, Levenberg–Marquardt) have also been
used for neural network training in recent years.
The Levenberg–Marquardt algorithm combines
the best features of the Gauss–Newton techni-
que and the steepest-descent algorithm but
avoids many of their limitations (Battiti, 1992;
Hagan & Menhaj, 1994). Therefore, the RNNs
implemented in this study were trained by the
Levenberg–Marquardt algorithm.
There is an outstanding issue associated with
the PNNs concerning network structure deter-
mination, i.e. determining the network size, the
locations of pattern layer neurons as well as the
value of the smoothing parameter. The PNNs
had pattern layer neurons, two summation
layer neurons, each corresponding to one of
two classes, and one output layer neuron to
make a two-class Bayesian decision. The objec-
tive is to select representative pattern layer
neurons from the training samples. The output
of a summation layer neuron becomes a linear
combination of the outputs of pattern layer
neurons. Subsequently, an orthogonal algo-
rithm was used to select pattern layer neurons.
As in the SVM training, the smoothing para-
meter s was determined based on the minimum
misclassification rate computed from the partial
evaluation data set. The minimum misclassifica-
tion rates were attained at s¼ 0.04 (for the
diabetes database) and s¼ 0.03 (for the breast
cancer database).
The EM algorithm can be extended to provide
an effective training mechanism for the ME
based on a Gaussian probability assumption.
Although originally the model structure is pre-
determined and the training algorithm is based
on the Gaussian probability assumption for
each expert model output, the ME framework
is a powerful concept that can be extended to a
wide variety of applications including medical
diagnostic decision support system applications
due to numerous inherent advantages such as
the following. (i) A global model can be decom-
posed into a set of simple local models, from
which controller design is straightforward. Each
model can represent a different data source with
an associated state estimator=predictor. In this
case the ME system can be viewed as a data
fusion algorithm. (ii) The local models operate
independently but provide output-correlated in-
formation that can be strongly correlated with
each other, so that the overall system perfor-
mance can be enhanced in terms of reliability or
fault tolerance. (iii) The global output of theME
system is derived as a convex combination of the
26 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
outputs from a set of N experts, in which the
overall system predictive performance is gener-
ally superior to that of any of the individual
experts.
Two sets of neural networks were trained for
the first-level models in the CNNs, since there
were two possible outcomes. Networks in each
set were trained so that they were likely to be
more accurate for one type of disorder than the
other disorder. The network architecture was
the MLPNN and each network had input neu-
rons equal to the dimension of attributes of the
records in the database (feature vector). Sam-
ples with target outputs were given the binary
target values of (0, 1) and (1, 0). The second-
level neural networks were trained to combine
the predictions of the first-level networks. The
second-level networks had four inputs which
corresponded to the outputs of the two groups
of first-level networks. The targets for the sec-
ond-level networks were the same as the targets
of the original data. In order to compare the
performance of the different classifiers for the
same classification problems, MLPNNs, which
are the most commonly used feedforward neural
networks, were also implemented. Different ex-
periments were perfomed during implementa-
tion of these classifiers and the number of
hidden neurons was determined by taking into
consideration the classification accuracies. In
the hidden layers and the output layers, the
activation function was the sigmoidal function.
The sigmoidal function with a range between
zero and one introduces two important proper-
ties. First, the sigmoid is non-linear, allowing
the network to perform complex mappings of
input to output vector spaces, and second it is
continuous and differentiable, which allows the
gradient of the error to be used in updating the
weights. The Levenberg–Marquardt algorithm
was used for training the CNNs and MLPNNs.
Table 3 defines the network parameters of the
classifiers implemented in this research.
Classification results of the classifiers were
displayed as a confusion matrix. In a confusion
matrix, each cell contains the raw number of
exemplars classified for the corresponding com-
bination of desired and actual network outputs.
The confusion matrices showing the classifica-
tion results of the classifiers implemented for
prediction of diabetes and breast cancer are
given in Tables 4 and 5. From these matrices
one can tell the frequency with which a record is
misclassified.
The test performance of the classifiers can be
determined by the computation of specificity,
sensitivity and total classification accuracy,
which are defined as follows.
Specificity: number of true negative deci-
sions=number of actual negative cases
Sensitivity: number of true positive deci-
sions=number of actual positive cases
Total classification accuracy: number of
correct decisions=total number of cases
In order to determine the performances of the
classifiers used for prediction of the diabetics
Table 3: Network parameters of the classifiers
Classifier Data set
Diabetes Breast cancer
SVM 8, 16, 2a 9, 12, 2a
RNN 8, 20r, 2b 9, 15r, 2b
PNN 8, 24, 2, 1c 9, 22, 2, 1c
ME 8, 20, 2d; 9, 15, 2d;8, 20, 2e 9, 15, 2e
CNN 8, 25, 4f; 9, 20, 4f;4, 25, 2g 4, 25, 2g
MLPNN 8, 20, 20, 2h 9, 15, 15, 2h
aDesign of SVMs: number of input neurons, support
vectors, output neurons, respectively.bDesign of RNNs: number of input neurons, recurrent
neurons in the hidden layer, output neurons, respectively.cDesign of PNNs: number of input neurons, pattern layer
neurons, summation layer neurons, output layer neurons,
respectively.dDesign of expert networks: number of input neurons,
hidden neurons, output neurons, respectively.eDesign of gating network: number of input neurons,
hidden neurons, output neurons, respectively.fDesign of first-level network: number of input neurons,
hidden neurons, output neurons, respectively.gDesign of second-level network: number of input neurons,
hidden neurons, output neurons, respectively.hDesign of neural network: number of input neurons,
hidden neurons in the first hidden layer, hidden neurons in
the second hidden layer, output neurons, respectively.
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 27
and breast cancer, the classification accuracies
(specificity, sensitivity, total classification accu-
racy) on the test sets are presented in Table 6.
5. Discussion
Based on the results of the present study and the
studies in the area of computational intelligence
existing in the literature (classification of breast
cancer and diabetes data sets), the following can
be mentioned.
1. Previous research in this area has been
undertaken by various researchers. Wu
et al. (1993) used an ANN to learn from
133 instances each containing 43 mammo-
graphic features rated between 0 and 10 by
a mammographer. The ANN was trained
using the backpropagation algorithm
using 10 hidden nodes and a single output
node was trained to produce 1 for malig-
nancy and 0 for benign. The performance
of the ANN was found to be competitive
to the domain expert, and after a consider-
able amount of feature selection the
performance of the ANN improved and
significantly outperformed the domain expert.
2. Another use of backpropagation was un-
dertaken by Floyd et al. (1994) who used
eight input parameters: mass size and mar-
gin, asymmetric density, architectural dis-
tortion, calcification number, morphology,
density and distribution. After extensive
experiments with backpropagation over
their limited data set of 260 cases, they
achieved a classification accuracy of 50%.
3. Wilding et al. (1994) suggested the use of
backpropagation; they followed a similar
backpropagation approach to the previous
references (Wu et al., 1993; Floyd et al.,
1994) but with different input sets derived
from a group of blood tests. However, with
104 instances and 10 inputs, it seems that
their ANN failed to perform well.
4. Backpropagation suffers the disadvantage
of being easily trapped in a local minimum.
Therefore, Fogel et al. (1995) used an
evolutionary programming approach to
train the ANN to overcome the disadvan-
tage of backpropagation. They used a
population of 500 networks and evolved
the population for 400 generations, there-
fore generating 20 000 potential networks.
The approach was tested on the Wisconsin
data set (Wolberg & Mangasarian, 1990)
which is used in this paper. They managed
to achieve a significant result with 98% of
the test cases correctly classified. Apart
from their few trials and the dependence
of their approach on a predefined network
Table 4: Confusion matrices of the classifiers
used for prediction of diabetes
Classifiers Desired result Output result
Non-diabetics Diabetics
SVM Non-diabetics 351 1Diabetics 1 131
RNN Non-diabetics 346 3Diabetics 6 129
PNN Non-diabetics 347 3Diabetics 5 129
ME Non-diabetics 345 2Diabetics 7 130
CNN Non-diabetics 342 4Diabetics 10 128
MLPNN Non-diabetics 322 12Diabetics 30 120
Table 5: Confusion matrices of the classifiers
used for prediction of breast cancer
Classifiers Desired result Output result
Benignrecords
Malignantrecords
SVM Benign records 273 1Malignant records 1 158
RNN Benign records 271 3Malignant records 3 156
PNN Benign records 270 4Malignant records 4 155
ME Benign records 271 2Malignant records 3 157
CNN Benign records 268 5Malignant records 6 154
MLPNN Benign records 253 14Malignant records 21 145
28 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
architecture, their approach performed
very well compared to the previous studies.
5. Setiono (1996) used rule extraction from
an ANN algorithm to extract useful rules
that can predict breast cancer from the
Wisconsin data set (Wolberg &Mangasar-
ian, 1990). He needed first to train an
ANN using backpropagation and
achieved an accuracy level on the test data
of approximately 94%. After applying the
rule extraction technique, the accuracy of
the extracted rule set did not change.
Setiono (2000) used feature selection be-
fore training the ANN. The new rule sets
had an average accuracy of more than
96%. This is an improvement compared
to the initial results.
6. Furundzic et al. (1998) presented another
backpropagation ANN attempt where
they used 47 input features, and after the
use of some heuristics to determine the
number of hidden units, they used five
hidden units. With 200 instances and after
a significant amount of feature selection,
they reduced the number of input features
to 29 while maintaining the same classifi-
cation accuracy.
7. Pendharkar et al. (1999) presented a com-
parison between data envelopment analy-
sis and ANNs. They found that the ANN
approach was significantly better than the
data envelopment analysis approach, with
around 25% improvement in classification
accuracy.
8. Abbass (2002) presented an evolutionary
ANN approach based on the Pareto differ-
ential evolution algorithm augmented with
local search for the prediction of breast
cancer. The study showed empirically that
the proposed approach had better general-
ization than previous approaches, with
much lower computational cost. The aver-
age accuracy obtained for the breast can-
cer data set was 98.1%.
9. The result of a study by Shanker (1996)
that used neural networks to predict the
onset of diabetes in Pima Indian women
showed that the neural network is a viable
approach to classification.
10. Park and Edington (2001) presented an
approach that uses a sequential MLPNN
with backpropagation learning and an ex-
plicit model of time-varying inputs along
with the sequentially obtained prediction
probability, which was obtained by em-
bedding a multivariate logistic function
for consecutive years. The approach out-
performed the baseline classification and
regression models in terms of sensitivity
(86.04%) for test data.
11. The results of the present study indicated
excellent performance of the SVMs on the
classification of the Pima Indians diabetes
database (total classification accuracy
99.59%) and the Wisconsin breast cancer
database (total classification accuracy
99.54%).
6. Conclusion
The purpose of the present research was to
investigate the accuracy of five types of auto-
Table 6: The classification accuracies of the classifiers
Classifier Diabetes Breast cancer
Specificity(%)
Sensitivity(%)
Total classificationaccuracy (%)
Specificity(%)
Sensitivity(%)
Total classificationaccuracy (%)
SVM 99.72 99.24 99.59 99.64 99.37 99.54RNN 98.30 97.73 98.14 98.91 98.11 98.61PNN 98.58 97.73 98.35 98.54 97.48 98.15ME 98.01 98.48 98.14 98.91 98.74 98.85CNN 97.16 96.97 97.11 97.81 96.86 97.46MLPNN 91.48 90.91 91.32 92.34 91.19 91.92
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 29
mated diagnostic systems, namely CNNs, ME,
PNNs, RNNs and SVMs, for clinical decision-
making. The performance of these classifiers
was then compared together and with that of
the MLPNN. These classifiers were trained on
the attributes of each record in the Pima Indians
diabetes database and the Wisconsin breast
cancer database. The classification results and
the values of statistical parameters indicated
that the SVMs had considerable success. The
SVM classifiers showed a great performance
since they map the features to a higher dimen-
sional space. Beside this, the RNN, PNN, ME
and CNN classifiers provided encouraging re-
sults. The performance of the MLPNN was not
as high as the other classifiers. This may be
attributed to several factors including the train-
ing algorithms, estimation of the network para-
meters and the scattered and mixed nature of the
features. The results obtained confirmed the
validity of the classifiers for clinical decision-
making.
References
ABBASS, H.A. (2002) An evolutionary artificial neuralnetworks approach for breast cancer diagnosis, Arti-ficial Intelligence in Medicine, 25, 265–281.
BASHEER, I.A. and M. HAJMEER (2000) Artificialneural networks: fundamentals, computing, design,and application, Journal ofMicrobiological Methods,43 (1), 3–31.
BATTITI, R. (1992) First- and second-order methods forlearning: between steepest descent and Newton’smethod, Neural Computation, 4, 141–166.
BESSER, G.M., H.J. BODANSKY and A.G. CUDWORTH
(1988) Clinical Diabetes: an Illustrated Text,London: Gower Medical.
BURRASCANO, P. (1991) Learning vector quantizationfor the probabilistic neural network, IEEE Transac-tions on Neural Networks, 2 (4), 458–461.
CHAUDHURI, B.B. and U. BHATTACHARYA (2000)Efficient training and improved performance ofmultilayer perceptron in pattern classification, Neu-rocomputing, 34, 11–27.
CHEN, K., L. XU and H. CHI (1999) Improved learningalgorithms for mixture of experts in multiclass clas-sification, Neural Networks, 12 (9), 1229–1252.
CORTES, C. and V. VAPNIK (1995) Support vectornetworks, Machine Learning, 20 (3), 273–297.
ELMAN, J.L. (1990) Finding structure in time,CognitiveScience, 14 (2), 179–211.
FLOYD, C.E., J.Y. LO, A.J. YUN, D.C. SULLIVAN andP.J. KORNGUTH (1994) Prediction of breast cancermalignancy using an artificial neural network, Can-cer, 74, 2944–2998.
FOGEL, D.B., E.C. WASSON and E.M. BOUGHTON
(1995) Evolving neural networks for detecting breastcancer, Cancer Letters, 96 (1), 49–53.
FURUNDZIC, D., M. DJORDJEVIC and A.J. BEKIC
(1998) Neural networks approach to early breastcancer detection, Journal of Systems Architecture,44 (8), 617–633.
GULER, I. and E.D. UBEYLI (2003) Detection ofophthalmic artery stenosis by least-mean squaresbackpropagation neural network, Computers in Biol-ogy and Medicine, 33 (4), 333–343.
GULER, I. and E.D. UBEYLI (2005a) ECG beat classi-fier designed by combined neural network model,Pattern Recognition, 38 (2), 199–208.
GULER, I. and E.D. UBEYLI (2005b) A mixture ofexperts network structure for modelling Dopplerultrasound blood flow signals, Computers in Biologyand Medicine, 35 (7), 565–582.
HAGAN, M.T. and M.B. MENHAJ (1994) Trainingfeedforward networks with the Marquardt algo-rithm, IEEE Transactions on Neural Networks, 5
(6), 989–993.HAYKIN, S. (1994) Neural Networks: A Comprehensive
Foundation, New York: Macmillan.HONG, X. and C.J. HARRIS (2002) A mixture of experts
network structure construction algorithm formodelling and control, Applied Intelligence, 16 (1),59–69.
ITCHHAPORIA, D., P.B. SNOW, R.J. ALMASSY and W.J.OETGEN (1996) Artificial neural networks: currentstatus in cardiovascular medicine, Journal of theAmerican College of Cardiology, 28 (2), 515–521.
JACOBS, R.A., M.I. JORDAN, S.J. NOWLAN and G.E.HINTON (1991) Adaptive mixtures of local experts,Neural Computation, 3 (1), 79–87.
JEREZ-ARAGONES, J.M., J.A. GOMEZ-RUIZ, G. RA-
MOS-JIMENEZ, J. MUNOZ-PEREZ and E. ALBA-CON-
EJO (2003) A combined neural network and decisiontrees model for prognosis of breast cancer relapse,Artificial Intelligence in Medicine, 27 (1), 45–63.
JORDAN, M.I. and R.A. JACOBS (1994) Hierarchicalmixture of experts and the EM algorithm, NeuralComputation, 6 (2), 181–214.
KORDYLEWSKI, H., D. GRAUPE and K. LIU (2001) Anovel large-memory neural network as an aid inmedical diagnosis applications, IEEE Transactionson Information Technology in Biomedicine, 5 (3),202–209.
KWAK, N. and C.-H. CHOI (2002) Input feature selec-tion for classification problems, IEEE Transactionson Neural Networks, 13 (1), 143–159.
LIM, C.P., R.F. HARRISON and R.L. KENNEDY (1997)Application of autonomous neural network systems
30 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.
to medical pattern classification tasks, ArtificialIntelligence in Medicine, 11, 215–239.
MILLER, A.S., B.H. BLOTT and T.K. HAMES (1992)Review of neural network applications in medicalimaging and signal processing, Medical and Biologi-cal Engineering and Computing, 30, 449–464.
MOBLEY, B.A., E. SCHECHTER, W.E. MOORE, P.A.MCKEE and J.E. EICHNER (2000) Predictions ofcoronary artery stenosis by artificial neural network,Artificial Intelligence in Medicine, 18, 187–203.
PARK, J. and D.W. EDINGTON (2001) A sequentialneural network model for diabetes prediction, Arti-ficial Intelligence in Medicine, 23, 277–293.
PENDHARKAR, P.C., J.A. RODGER, G.J. YAVERBAUM,N. HERMAN and M. BENNER (1999) Association,statistical, mathematical and neural approaches formining breast cancer patterns, Expert Systems withApplications, 17, 223–232.
PETROSIAN, A., D. PROKHOROV, R. HOMAN, R. DA-
SHEIFF and D.D. WUNSCH II (2000) Recurrentneural network based prediction of epileptic seizuresin intra- and extra-cranial EEG, Neurocomputing,30, 201–218.
PINEDA, F.J. (1987) Generalization of back-propaga-tion to recurrent neural networks, Physical ReviewLetters, 59 (19), 2229–2232.
SETIONO, R. (1996) Extracting rules from prunedneural networks for breast cancer diagnosis, Artifi-cial Intelligence in Medicine, 8 (1), 37–51.
SETIONO, R. (2000) Generating concise and accurateclassification rules for breast cancer diagnosis, Arti-ficial Intelligence in Medicine, 18 (3), 205–219.
SHANKER, M.S. (1996) Using neural networks to pre-dict the onset of diabetes mellitus, Journal of Chemi-cal Information and Computer Sciences, 36, 35–41.
SHIEH, J.-S., C.-F. CHOU, S.-J. HUANG andM.-C. KAO
(2004) Intracranial pressure model in intensive careunit using a simple recurrent neural network throughtime, Neurocomputing, 57, 239–256.
SPECHT, D.F. (1990) Probabilistic neural networks,Neural Networks, 3 (1), 109–118.
TAFEIT, E. and G. REIBNEGGER (1999) Artificial neuralnetworks in laboratory medicine and medical out-come prediction, Clinical Chemistry and LaboratoryMedicine, 37 (9), 845–853.
UBEYLI, E.D. and I. GULER (2003) Neural networkanalysis of internal carotid arterial Doppler signals:predictions of stenosis and occlusion, Expert Sys-tems with Applications, 25 (1), 1–13.
UBEYLI, E.D. and I. GULER (2005) Feature extractionfrom Doppler ultrasound signals for automateddiagnostic systems, Computers in Biology and Medi-cine, 35 (9), 735–764.
VAPNIK, V. (1995) The Nature of Statistical LearningTheory, New York: Springer.
WEST, D. and V. WEST (2000) Model selection for amedical diagnostic decision support system: a breastcancer detection case, Artificial Intelligence in Medi-cine, 20 (3), 183–204.
WILDING, P., M.A. MORGAN, A.E. GRYGOTIS, M.A.SHOFFNER and E.F. ROSATO (1994) Application ofbackpropagation neural networks to diagnosis ofbreast and ovarian cancer, Cancer Letters, 77, 145–153.
WOLBERG, W.H. and O.L. MANGASARIAN (1990)Multisurface method of pattern separation for med-ical diagnosis applied to breast cytology, Proceedingsof the National Academy of Sciences, 87, 9193–9196.
WOLPERT, D.H. (1992) Stacked generalization, NeuralNetworks, 5, 241–259.
WU, Y.Z., M.L. GIGER, K. DOI, C.J. VYBORNY, R.A.SCHMIDT and C.E. METZ (1993) Artificial neuralnetworks in mammography: application to decisionmaking in the diagnosis of breast cancer, Radiology,187, 81–87.
The author
Elif Derya Ubeyli
Elif Derya Ubeyli received her first degree and
an MSc in electronic engineering from Cukur-
ova University, Turkey, and her PhD in electro-
nics and computer technology from Gazi
University. She is an associate professor in the
Department of Electrical and Electronics Engi-
neering at TOBB Economics and Technology
University, Ankara. Her interest areas include
biomedical signal processing, neural networks
and artificial intelligence. She has written more
than 75 articles on biomedical engineering.
c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 31