comparison of different classification algorithms in clinical decision-making

Comparison of different classificationalgorithms in clinical decision-making

Elif Derya UbeyliDepartment of Electrical and Electronics Engineering, Faculty of Engineering, TOBBEkonomi ve Teknoloji Universitesi, 06530 Sogutozu, Ankara, TurkeyE-mail: [email protected]

Abstract: This paper gives an integrated view of implementing automated diagnostic systems for clinical

decision-making. Because of the importance of making the right decision, better classification procedures are

necessary for clinical decisions. The major objective of the paper is to be a guide for readers who want to develop

an automated decision support system for clinical practice. The purpose was to determine an optimum

classification scheme with high diagnostic accuracy for this problem. Several different classification algorithms

were tested and benchmarked for their performance. The performance of the classification algorithms is

illustrated on two data sets: the Pima Indians diabetes and the Wisconsin breast cancer. The present research

demonstrates that the support vector machines achieved diagnostic accuracies which were higher than those of

other automated diagnostic systems.

Keywords: classification algorithms, automated diagnostic systems, clinical decision-making, diag-nostic accuracy

1. Introduction

Artificial neural networks (ANNs) are valuable

tools in the medical field for the development of

decision support systems. Important tools in

modern decision-making, in any field, include

those that allow the decision-maker to assign an

object to an appropriate group or classification.

Clinical decision-making is a challenging, multi-

faceted process. Its goals are precision in diag-

nosis and institution of efficacious treatment.

Achieving these objectives involves access to

pertinent data and application of previous

knowledge to the analysis of new data in order

to recognize patterns and relations. Practi-

tioners apply various statistical techniques in

processing data to assist in clinical decision-

making and to facilitate the management of

patients. As the volume and complexity of data

have increased, use of digital computers to

support data analysis has become a necessity.

In addition to computerization of standard

statistical analysis, several other techniques for

computer-aided data classification and reduc-

tion, generally referred to as ANNs, have

evolved (Miller et al., 1992; Itchhaporia et al.,

1996; Tafeit & Reibnegger, 1999; Basheer &

Hajmeer, 2000).

On analysing recent developments, it becomes

clear that the trend is to develop new methods

for computer decision-making in medicine and

to evaluate these methods critically in clinical

practice. ANNs have been used in different

medical diagnoses and the results have been

compared with physicians’ diagnoses and exist-

ing classification methods (Setiono, 1996, 2000;

Shanker, 1996; Lim et al., 1997; Mobley et al.,

2000; West & West, 2000; Park & Edington,

Article _____________________________

c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd. Expert Systems, February 2007, Vol. 24, No. 1 17

2001; Guler & Ubeyli, 2003; Ubeyli & Guler,

2003). Many of these researchers found that

ANNs have more flexibility in modelling and

reasonable accuracy in prediction. What makes

neural networks a promising tool is their capa-

city to find near-optimum solutions from limited

or incomplete data sets and the fact that learn-

ing is accomplished through training. In addi-

tion to these characteristics, it has been shown

that neural networks can combine data of a

different nature in one system, such as data

derived from clinical protocols, laboratory data

obtained from measurements and features from

signals and images, thus forming an inte-

grated diagnostic system (Miller et al., 1992;

Itchhaporia et al., 1996; Tafeit & Reibnegger,

1999; Basheer & Hajmeer, 2000).

The main concept of medical technology is an

inductive engine that learns the decision char-

acteristics of the diseases and can then be used to

diagnose future patients with uncertain disease

states. A number of quantitative models in-

cluding multilayer perceptron neural networks

(MLPNNs), combined neural networks

(CNNs), mixture of experts (ME), probabilistic

neural networks (PNNs), recurrent neural net-

works (RNNs) and support vector machines

(SVMs) are being used in medical diagnostic

support systems to assist human decision-

makers in disease diagnosis (Kordylewski et al.,

2001; Kwak & Choi, 2002; Ubeyli & Guler,

2005). Unfortunately, there is no theory avail-

able to guide an intelligent choice of model

based on the complexity of the diagnostic task.

In most situations, developers are simply pick-

ing a single model that yields satisfactory re-

sults, or they are benchmarking a small subset of

models with cross-validation estimates on test

sets (Kordylewski et al., 2001; Kwak & Choi,

2002; Ubeyli & Guler, 2005). Figure 1 shows the

various stages followed for the design of a

classification system. As is apparent from the

feedback arrows, these stages are not indepen-

dent. On the contrary, they are interrelated and,

depending on the results, one may go back to

redesign earlier stages in order to improve the

overall performance. Feedforward neural net-

works are a basic type of ANNs capable of

approximating generic classes of functions, in-

cluding continuous and integrable functions. An

important class of feedforward neural networks

is MLPNNs. The MLPNNs have features such

as the ability to learn and generalize, smaller

training set requirements, fast operation and

ease of implementation, and therefore they are

the most commonly used neural network archi-

tectures (Haykin, 1994; Basheer & Hajmeer,

2000; Chaudhuri & Bhattacharya, 2000). The

economic and social values of diabetes and

breast cancer are very high. As a result, these

problems have attracted many researchers in the

area of computational intelligence recently

(Setiono, 1996, 2000; Shanker, 1996; West &

West, 2000; Park & Edington, 2001). They

managed to achieve significant results by using

different classifiers. In this paper, a comparison

of several different classifiers, which were

trained on the attributes of each record in the

diabetes and breast cancer databases, is consid-

ered. Among these are the CNN, ME, PNN,

RNN and SVM. The performances of these

systems were then compared with that of the

MLPNN. Significant improvement in accuracy

was achieved by using the SVM compared to the

other classification algorithms.

2. Database overview

Diabetes is a metabolic disease in which there is

a deficiency or absence of insulin secretion by

Sensor Feature extraction

Featureselection

Classifierdesign

Systemevaluation

Patterns

Figure 1: The basic stages involved in the design of a classification system.

18 Expert Systems, February 2007, Vol. 24, No. 1 c� 2007 The Author. Journal Compilation c� 2007 Blackwell Publishing Ltd.

the pancreas. Diabetes occurs in two major

forms: type I, or insulin-dependent diabetes,

and type II, or non-insulin-dependent diabetes.

Most studies of the age of onset of type I

diabetes have been restricted to children and

adults under 30 years old. The age distributions

in females and males show small differences

which have been most clearly demonstrated in

surveys in children and young adults. Both

genders are affected, but in many communities

the majority with type II diabetes are female.

The prevalence of type II diabetes rises with

increasing age and in many populations the

majority of those with diabetes are either

middle-aged or elderly. Diabetes may be consid-

ered as a disorder of the metabolic disposal of

food. The interaction of food and diabetic state

must be assessed from two aspects: first, whether

food precipitates the diabetic condition and,

second, the type of food that is most appropriate

for the person with established diabetes,

whether it is insulin-dependent or non-insulin-

dependent. Diabetes shows considerable famil-

ial aggregation which may result from either the

inheritance of disease susceptibility or the shar-

ing of a common environment by members of

the same family. Therefore, it is important to

determine the existence of diabetes in families of

the subjects. The occurrence of gestational dia-

betes, which either first appears or is first recog-

nized during pregnancy, is associated with

increased risk for the development of diabetes

in subsequent years. Thus, women who experi-

ence gestational diabetes should be considered

a high risk group for the development of type II

diabetes (Besser et al., 1988). In this study,

the Pima Indians diabetes database (http://

www.cormactech.com/neunet) was analysed.

The data consist of 768 records and according

to the examination results 268 of them are

diabetics and the rest of them are non-diabetics.

Each record has eight attributes and these are

detailed in Table 1. Eight independent input

parameters, essentially risk factors for diabetes,

are incorporated in the classifiers.

Breast cancer is a malignant tumour that has

developed from cells of the breast. Although

scientists know some of the risk factors (i.e.

ageing, genetic risk factors, family history, men-

strual periods, not having children, obesity) that

increase a woman’s chance of developing breast

cancer, they do not yet know what causes most

breast cancers or exactly how some of these risk

factors cause cells to become cancerous. Re-

search is under way to learn more and scientists

are making great progress in understanding

how certain changes in DNA can cause normal

breast cells to become cancerous (Jerez-

Aragones et al., 2003). In this study, the Wiscon-

sin breast cancer database taken from fine

needle aspirates from human breast tissue

was analysed. The aspirates were collected by

Wolberg and Mangasarian (1990) at the Uni-

versity of Wisconsin-Madison Hospitals. The

data consist of 683 records of virtually assessed

nuclear features of fine needle aspirates taken

from patients’ breasts. Each record in the data-

base has nine attributes. The nine attributes

Table 1: Pima Indians diabetes database: description of attributes

Attribute number Attribute description Mean Standard deviation

1 Number of times pregnant 3.8 3.42 Plasma glucose concentration at

2 h in an oral glucose tolerance test120.9 32.0

3 Diastolic blood pressure (mmHg) 69.1 19.44 Triceps skin fold thickness (mm) 20.5 16.05 2-hour serum insulin (mU=ml) 79.8 115.26 Body mass index (weight in kg=(height in m)2) 32.0 7.97 Diabetes pedigree function 0.5 0.38 Age (years) 33.2 11.8

N¼ 768 observations, 268 diabetics and 500 non-diabetics.


detailed in Table 2 are graded on an interval

scale from a normal state of 1 to 10, with 10

being the most abnormal state. There are 239

malignant cases and 444 benign cases. A malig-

nant label is confirmed by performing a biopsy

on the breast tissue. Either a biopsy or a periodic

examination is used to confirm a benign label.

3. Brief review of different automated

diagnostic systems

3.1. CNN

The CNN models often result in a prediction

accuracy that is higher than that of the indivi-

dual models. This construction is based on a

straightforward approach that has been termed

stacked generalization (Figure 2). Training data

that are difficult to learn usually demonstrate

high dispersion in the search space due to the

inability of the low-level measurement attributes

to describe the concept concisely. Because of the

complex interactions among variables and the

high degree of noise and fluctuations, a signifi-

cant number of data used for applications are

naturally available in representations that are

difficult to learn. The degree of difficulty in

training a neural network is inherent in the given

set of training examples. By developing a tech-

nique for measuring this learning difficulty, a

feature construction methodology is devised

that transforms the training data and attempts

to improve both the classification accuracy and

computational times of ANN algorithms. The

fundamental notion is to organize data by

intelligent preprocessing, so that learning is

facilitated (Wolpert, 1992; Guler & Ubeyli,

2005a). The stacked generalization concepts

formalized by Wolpert (1992) predate these

ideas and refer to schemes for feeding informa-

tion from one set of generalizers to another

before forming the final predicted value (out-

put). The unique contribution of stacked gener-

alization is that the information fed into the net

of generalizers comes from multiple partition-

ings of the original learning set. The stacked

generalization scheme can be viewed as a more

sophisticated version of cross-validation and

has been shown experimentally to effectively

improve the generalization ability of ANN

models over individual neural networks. The

MLPNNs were used at the first level and second

level for the implementation of the CNNs pro-

posed in this study.

3.2. ME

The ME architecture is composed of a gating

network and several expert networks (Figure 3).

The gating network receives the vector x as

input and produces scalar outputs that are a

partition of unity at each point in the input

space. Each expert network produces an output

vector for an input vector. The gating network

provides linear combination coefficients as ver-

idical probabilities for expert networks, and

therefore the final output of theME architecture

Table 2: Wisconsin breast cancer database: description of attributes

Attribute number Attribute description Minimum Maximum Mean Standard deviation

1 Clump thickness 1 10 4.44 2.822 Uniformity of cell size 1 10 3.15 3.073 Uniformity of cell shape 1 10 3.22 2.994 Marginal adhesion 1 10 2.83 2.865 Single epithelial cell size 1 10 3.23 2.226 Bare nuclei 1 10 3.54 3.647 Bland chromatin 1 10 3.45 2.458 Normal nucleoli 1 10 2.87 3.059 Mitoses 1 10 1.60 1.73

N¼ 683 observations, 239 malignant and 444 benign.


is a convex weighted sum of all the output

vectors produced by the expert networks.

Suppose that there are N expert networks in the

ME architecture. All the expert networks are

linear with a single output non-linearity that is

also referred to as ‘generalized linear’. The ith

expert network produces its output oi(x) as

a generalized linear function of the input x

(Jacobs et al., 1991; Chen et al., 1999; Hong &

Harris, 2002):

oiðxÞ¼ f ðW ixÞ ð1Þ

where Wi is a weight matrix and f ð�Þ is a fixed

continuous non-linearity. The gating network is

also a generalized linear function, and its ith

output, gðx; viÞ, is the multinomial logit or

Output 1 Output 2 Output 3 Output j

Multilayer perceptron neural network

Output 1 Output 2 Output 3 Output j

Hidden layer 1 neuronsh = 1, 2, ..., k

Hidden layer N neurons h = 1, 2, ..., m

Output layer neuronso = 1, 2, ..., j

1st level

2nd level

Figure 2: CNN architecture.

Expert Network

1

Expert Network

N

(x)O

X X

Gating Network

X

Figure 3: Architecture of the ME.


softmax function of intermediate variables xi:

gðx; viÞ¼exiPN

k¼ 1 exk

ð2Þ

where xi ¼ vTi x and vi is a weight vector. The

overall output o(x) of the ME architecture is

oðxÞ¼XNk¼ 1

gðx; vkÞ okðxÞ ð3Þ

The ME architecture can be given a probabil-

istic interpretation. For an input–output pair (x,

y), the values of gðvi; xÞare interpreted as the

multinomial probabilities associated with the

decision that terminates in a regressive process

that maps x to y. Once the decision has been

made, resulting in a choice of regressive process

i, the output y is then chosen from a probability

density Pðy x;W ij Þ, where Wi denotes the set of

parameters or weight matrix of the ith expert

network in the model. Therefore, the total

probability of generating y from x is a mixture

of the probabilities of generating y from each

component density, where the mixing propor-

tions are multinomial probabilities:

Pðy x;Fj Þ¼XNk¼ 1

gðx; vkÞPðy x;Wkj Þ ð4Þ

where F is the set of all the parameters including

both expert and gating network parameters.

Based on the probabilistic model, learning in

the ME architecture is treated as a maximum

likelihood problem. Jordan and Jacobs (1994)

have proposed an expectation maximization

(EM) algorithm for adjusting the parameters of

the architecture. In this framework a number of

relatively small expert networks can be used

together with a gating network designed to

divide the global classification task into simpler

subtasks (Guler & Ubeyli, 2005b).

3.3. PNN

The PNN was first proposed by Specht (1990).

A single PNN is capable of handling multiclass

problems. This is opposite to the so-called one-

against-the-rest or one-per-class approach taken

by some classifiers, such as the SVM, which

decompose a multiclass classification problem

into dichotomies and each chotomizer has to

separate a single class from all others. The

architecture of a typical PNN is shown in Figure

4. The PNN architecture is composed of many

interconnected processing units or neurons or-

ganized in successive layers. The input layer unit

does not perform any computation and simply

distributes the input to the neurons in the

pattern layer. On receiving a pattern x from the

input layer, the neuron xij of the pattern layer

computes its output:

fijðxÞ¼1

ð2pÞd=2sdexp �ðx� xijÞTðx� xijÞ

2s2

" #

ð5Þ

where d denotes the dimension of the pattern

vector x, s is the smoothing parameter and xij is

the neuron vector. The summation layer neu-

rons compute the maximum likelihood of pat-

tern x being classified into Ci by summing and

averaging the output of all neurons that belong

to the same class:

piðxÞ

¼ 1

ð2pÞd=2sd1

Ni

XNi

j¼ 1

exp �ðx� xijÞTðx� xijÞ2s2

" #

ð6Þ

whereNi denotes the total number of samples in

class Ci. If the a priori probabilities for each

class are the same, and the losses associated with

making an incorrect decision for each class are

the same, the decision layer unit classifies the

pattern x in accordance with Bayes’s decision

rule based on the output of all the summation

layer neurons:

CðxÞ¼ arg max piðxÞ½ � i¼ 1; 2; . . . ; m ð7Þ

where CðxÞdenotes the estimated class of the

pattern x and m is the total number of classes in

the training samples (Specht, 1990; Burrascano,

1991).


3.4. RNN

RNNs have been used in a number of interesting

applications including associative memories,

spatiotemporal pattern classification, control,

optimization, forecasting and generalization of

pattern sequences (Petrosian et al., 2000; Shieh

et al., 2004). Fully recurrent networks use un-

constrained fully interconnected architectures

and learning algorithms that can deal with

time-varying input and=or output in non-trivial

ways. In spite of several modifications of learn-

ing algorithms to reduce the computational

expense, fully recurrent networks are still com-

plicated when dealing with complex problems.

Therefore, partially recurrent networks, whose

connections are mainly feedforward, are used

but they include a carefully chosen set of feed-

back connections. The recurrence allows the

network to remember cues from the past with-

out complicating the learning excessively. The

structure proposed by Elman (1990) is an illus-

tration of this kind of architecture. An Elman

RNN was used in this application and therefore

in the following the Elman RNN is presented.

An Elman RNN is a network which in princi-

ple is set up as a regular feedforward network.

This means that all neurons in one layer are

connected with all neurons in the next layer. An

exception is the so-called context layer which is a

special case of a hidden layer. Figure 5 shows the

architecture of an Elman RNN. The neurons in

the context layer (context neurons) hold a copy

of the output of the hidden neurons. The output

of each hidden neuron is copied into a specific

neuron in the context layer. The value of the

context neuron is used as an extra input signal

for all the neurons in the hidden layer one time

step later. Therefore the Elman network has an

explicit memory of one time lag (Elman, 1990).

Similar to a regular feedforward neural net-

work, the strength of all connections between

neurons are indicated with a weight. Initially, all

weight values are chosen randomly and are

optimized during the stage of training. In an

Elman network, the weights from the hidden

layer to the context layer are set to one and are

fixed because the values of the context neurons

have to be copied exactly. Furthermore, the

initial output weights of the context neurons

x

x x x x x xx x x

)(xp )(xp )(xp

)(xC

Patternlayer

Input layer

Summationlayer

Decisionlayer

Figure 4: Architecture of the PNN.


are equal to half the output range of the other

neurons in the network.

3.5. SVM

The SVM proposed by Vapnik (1995) has been

studied extensively for classification, regression

and density estimation. Figure 6 shows the

architecture of the SVM. The SVM maps the

input patterns into a higher dimensional feature

space through some non-linear mapping chosen

a priori. A linear decision surface is then con-

structed in this high-dimensional feature space.

Thus, the SVM is a linear classifier in the

z zz

x x x

y1 y2 yn

Output layer

Hiddenlayer

Input layer

Context layer

Figure 5: A schematic representation of an Elman RNN; z�1 represents a one-time-step delay unit.

Inputs

)(⋅K

)(⋅K

w

w

w

Output

b

∑

+1

–1

)(⋅K

Figure 6: Architecture of the SVM (N is the number of support vectors).


parameter space, but it becomes a non-linear

classifier as a result of the non-linear mapping of

the space of the input patterns into the high-

dimensional feature space. Training the SVM is

a quadratic optimization problem. The con-

struction of a hyperplane wTxþ b¼ 0 (w is the

vector of hyperplane coefficients, b is a bias

term) so that the margin between the hyperplane

and the nearest point is maximized can be

posed as the quadratic optimization problem.

The SVM has been shown to provide

high generalization ability. For a two-class

problem, assuming the optimal hyperplane in

the feature space is generated, the classification

decision of an unknown pattern y will be made

based on

f ðyÞ¼ sgnXNi¼ 1

aiyiKðxi; yÞ þ b

" #ð8Þ

where aiZ0, i¼ 1, 2, . . ., N, are non-nega-

tive Lagrange multipliers that satisfyPNi¼ 1 aiyi ¼ 0; yi yi 2 �1; þ1f gjf gNi¼ 1 are class

labels of training patterns xi xi 2 RN�� N

i¼ 1, and

Kðxi; yÞ for i¼ 1, 2, . . ., N represents a sym-

metric positive definite kernel function that

defines an inner product in the feature space.

This shows that f(y) is a linear combination of

the inner products or kernels. The kernel func-

tion enables the operations to be carried out in

the input space rather than in the high-dimen-

sional feature space. Some typical examples of

kernel functions are Kðu; vÞ¼ vTu (linear SVM),

Kðu; vÞ¼ ðvTuþ 1Þn (polynomial SVM of degree

n), Kðu; vÞ¼ expð� u� vk k2=2s2Þ (radial basis

function SVM), and Kðu; vÞ¼ tanhðkvTyþ yÞ(two-layer neural SVM), where s, k and y are

constants (Cortes & Vapnik, 1995; Vapnik,

1995). However, a proper kernel function

for a certain problem is dependent on the

specific data and so far there is no good method

on how to choose the kernel function. In

this study, the choice of the kernel function

was studied empirically and optimal results

were achieved using the radial basis kernel

function.

4. Results

The classifiers proposed for clinical decision-

making were implemented by using the MAT-

LAB software package (MATLAB version 7.0

with neural networks toolbox). The attributes of

diabetes and breast cancer detailed in Tables 1

and 2 were used as the inputs of the classifiers.

The key design decisions for the neural networks

used in classification are the architecture and the

training process. The architectures of the CNN,

ME, PNN, RNN and SVM used for prediction

of diabetes and breast cancer are shown in

Figures 2–6, respectively. The adequate func-

tioning of neural networks depends on the sizes

of the training set and test set. To comparatively

evaluate the performance of the classifiers, all

the classifiers presented in this study were

trained by the same training data set and tested

with the evaluation data set. There are a total of

768 records in the Pima Indians diabetes data-

base, of which 268 are diabetics and 500 are

non-diabetics. In the classifiers, 284 of 768

records were used for training and the rest for

testing. The training set consisted of 136 dia-

betics and 148 non-diabetics. The testing set

consisted of 132 diabetics and 352 non-

diabetics. There are a total of 683 records in the

Wisconsin breast cancer database, of which 444

are benign records and 239 are malignant re-

cords. In the classifiers, 250 of 683 records were

used for training and the rest for testing. The

training set consisted of 80 malignant records

and 170 benign records. The testing set consisted

of 159 malignant records and 274 benign re-

cords.

The training algorithm of the SVMs, based on

quadratic programming, incorporates several

optimization techniques such as decomposition

and caching. The quadratic programming prob-

lem in the SVMs was solved by using the

MATLAB optimization toolbox. For the imple-

mentation of the SVMs with the radial basis

kernel functions, one has to assume a value for

s. The optimal s can only be found by system-

atically varying its value in the different training

sessions. To do this, the support vectors were

extracted from the training data file with an


assumed s value. After the support vectors had

been found and SVMs had been constructed, the

model was applied to a third of the evaluation

data set to compute the misclassification rate.

The s value was varied between 0.1 and 0.6, at

an interval of 0.1. Values of s¼ 0.3 for the

diabetes database and s¼ 0.4 for the breast

cancer database resulted in the minimum mis-

classification rate and were thus chosen. The

generalization ability of the SVMs is controlled

by two different factors: the training error rate

and the capacity of the learning machine mea-

sured by its Vapnik–Chervonenkis (VC) dimen-

sion (Vapnik, 1995). The smaller the VC

dimension of the function set of the learning

machine, the larger the value of the training

error rate. The tradeoff between the complexity

of the decision rule and the training error rate

can be controlled by changing a parameter C

(Cortes & Vapnik, 1995) in the SVMs. The

SVMs were trained for different C values until

the best results were obtained: C¼ 90 for the

diabetes database and C¼ 80 for the breast

cancer database in the testing procedure.

The Elman network can be trained with

gradient descent backpropagation and optimi-

zation methods, similar to regular feedforward

neural networks (Pineda, 1987). Backpropaga-

tion has some problems for many applications.

The algorithm is not guaranteed to find the

global minimum of the error function since

gradient descent may get stuck in local minima,

where it may remain indefinitely. In addition to

this, long training sessions are often required in

order to find an acceptable weight solution

because of the well-known difficulties inherent

in gradient descent optimization (Haykin,

1994). Therefore many variations to improve

the convergence of the backpropagation were

proposed. Optimization methods such as sec-

ond-order methods (conjugate gradient, quasi-

Newton, Levenberg–Marquardt) have also been

used for neural network training in recent years.

The Levenberg–Marquardt algorithm combines

the best features of the Gauss–Newton techni-

que and the steepest-descent algorithm but

avoids many of their limitations (Battiti, 1992;

Hagan & Menhaj, 1994). Therefore, the RNNs

implemented in this study were trained by the

Levenberg–Marquardt algorithm.

There is an outstanding issue associated with

the PNNs concerning network structure deter-

mination, i.e. determining the network size, the

locations of pattern layer neurons as well as the

value of the smoothing parameter. The PNNs

had pattern layer neurons, two summation

layer neurons, each corresponding to one of

two classes, and one output layer neuron to

make a two-class Bayesian decision. The objec-

tive is to select representative pattern layer

neurons from the training samples. The output

of a summation layer neuron becomes a linear

combination of the outputs of pattern layer

neurons. Subsequently, an orthogonal algo-

rithm was used to select pattern layer neurons.

As in the SVM training, the smoothing para-

meter s was determined based on the minimum

misclassification rate computed from the partial

evaluation data set. The minimum misclassifica-

tion rates were attained at s¼ 0.04 (for the

diabetes database) and s¼ 0.03 (for the breast

cancer database).

The EM algorithm can be extended to provide

an effective training mechanism for the ME

based on a Gaussian probability assumption.

Although originally the model structure is pre-

determined and the training algorithm is based

on the Gaussian probability assumption for

each expert model output, the ME framework

is a powerful concept that can be extended to a

wide variety of applications including medical

diagnostic decision support system applications

due to numerous inherent advantages such as

the following. (i) A global model can be decom-

posed into a set of simple local models, from

which controller design is straightforward. Each

model can represent a different data source with

an associated state estimator=predictor. In this

case the ME system can be viewed as a data

fusion algorithm. (ii) The local models operate

independently but provide output-correlated in-

formation that can be strongly correlated with

each other, so that the overall system perfor-

mance can be enhanced in terms of reliability or

fault tolerance. (iii) The global output of theME

system is derived as a convex combination of the


outputs from a set of N experts, in which the

overall system predictive performance is gener-

ally superior to that of any of the individual

experts.

Two sets of neural networks were trained for

the first-level models in the CNNs, since there

were two possible outcomes. Networks in each

set were trained so that they were likely to be

more accurate for one type of disorder than the

other disorder. The network architecture was

the MLPNN and each network had input neu-

rons equal to the dimension of attributes of the

records in the database (feature vector). Sam-

ples with target outputs were given the binary

target values of (0, 1) and (1, 0). The second-

level neural networks were trained to combine

the predictions of the first-level networks. The

second-level networks had four inputs which

corresponded to the outputs of the two groups

of first-level networks. The targets for the sec-

ond-level networks were the same as the targets

of the original data. In order to compare the

performance of the different classifiers for the

same classification problems, MLPNNs, which

are the most commonly used feedforward neural

networks, were also implemented. Different ex-

periments were perfomed during implementa-

tion of these classifiers and the number of

hidden neurons was determined by taking into

consideration the classification accuracies. In

the hidden layers and the output layers, the

activation function was the sigmoidal function.

The sigmoidal function with a range between

zero and one introduces two important proper-

ties. First, the sigmoid is non-linear, allowing

the network to perform complex mappings of

input to output vector spaces, and second it is

continuous and differentiable, which allows the

gradient of the error to be used in updating the

weights. The Levenberg–Marquardt algorithm

was used for training the CNNs and MLPNNs.

Table 3 defines the network parameters of the

classifiers implemented in this research.

Classification results of the classifiers were

displayed as a confusion matrix. In a confusion

matrix, each cell contains the raw number of

exemplars classified for the corresponding com-

bination of desired and actual network outputs.

The confusion matrices showing the classifica-

tion results of the classifiers implemented for

prediction of diabetes and breast cancer are

given in Tables 4 and 5. From these matrices

one can tell the frequency with which a record is

misclassified.

The test performance of the classifiers can be

determined by the computation of specificity,

sensitivity and total classification accuracy,

which are defined as follows.

Specificity: number of true negative deci-

sions=number of actual negative cases

Sensitivity: number of true positive deci-

sions=number of actual positive cases

Total classification accuracy: number of

correct decisions=total number of cases

In order to determine the performances of the

classifiers used for prediction of the diabetics

Table 3: Network parameters of the classifiers

Classifier Data set

Diabetes Breast cancer

SVM 8, 16, 2a 9, 12, 2a

RNN 8, 20r, 2b 9, 15r, 2b

PNN 8, 24, 2, 1c 9, 22, 2, 1c

ME 8, 20, 2d; 9, 15, 2d;8, 20, 2e 9, 15, 2e

CNN 8, 25, 4f; 9, 20, 4f;4, 25, 2g 4, 25, 2g

MLPNN 8, 20, 20, 2h 9, 15, 15, 2h

aDesign of SVMs: number of input neurons, support

vectors, output neurons, respectively.bDesign of RNNs: number of input neurons, recurrent

neurons in the hidden layer, output neurons, respectively.cDesign of PNNs: number of input neurons, pattern layer

neurons, summation layer neurons, output layer neurons,

respectively.dDesign of expert networks: number of input neurons,

hidden neurons, output neurons, respectively.eDesign of gating network: number of input neurons,

hidden neurons, output neurons, respectively.fDesign of first-level network: number of input neurons,

hidden neurons, output neurons, respectively.gDesign of second-level network: number of input neurons,

hidden neurons, output neurons, respectively.hDesign of neural network: number of input neurons,

hidden neurons in the first hidden layer, hidden neurons in

the second hidden layer, output neurons, respectively.


and breast cancer, the classification accuracies

(specificity, sensitivity, total classification accu-

racy) on the test sets are presented in Table 6.

5. Discussion

Based on the results of the present study and the

studies in the area of computational intelligence

existing in the literature (classification of breast

cancer and diabetes data sets), the following can

be mentioned.

1. Previous research in this area has been

undertaken by various researchers. Wu

et al. (1993) used an ANN to learn from

133 instances each containing 43 mammo-

graphic features rated between 0 and 10 by

a mammographer. The ANN was trained

using the backpropagation algorithm

using 10 hidden nodes and a single output

node was trained to produce 1 for malig-

nancy and 0 for benign. The performance

of the ANN was found to be competitive

to the domain expert, and after a consider-

able amount of feature selection the

performance of the ANN improved and

significantly outperformed the domain expert.

2. Another use of backpropagation was un-

dertaken by Floyd et al. (1994) who used

eight input parameters: mass size and mar-

gin, asymmetric density, architectural dis-

tortion, calcification number, morphology,

density and distribution. After extensive

experiments with backpropagation over

their limited data set of 260 cases, they

achieved a classification accuracy of 50%.

3. Wilding et al. (1994) suggested the use of

backpropagation; they followed a similar

backpropagation approach to the previous

references (Wu et al., 1993; Floyd et al.,

1994) but with different input sets derived

from a group of blood tests. However, with

104 instances and 10 inputs, it seems that

their ANN failed to perform well.

4. Backpropagation suffers the disadvantage

of being easily trapped in a local minimum.

Therefore, Fogel et al. (1995) used an

evolutionary programming approach to

train the ANN to overcome the disadvan-

tage of backpropagation. They used a

population of 500 networks and evolved

the population for 400 generations, there-

fore generating 20 000 potential networks.

The approach was tested on the Wisconsin

data set (Wolberg & Mangasarian, 1990)

which is used in this paper. They managed

to achieve a significant result with 98% of

the test cases correctly classified. Apart

from their few trials and the dependence

of their approach on a predefined network

Table 4: Confusion matrices of the classifiers

used for prediction of diabetes

Classifiers Desired result Output result

Non-diabetics Diabetics

SVM Non-diabetics 351 1Diabetics 1 131

RNN Non-diabetics 346 3Diabetics 6 129

PNN Non-diabetics 347 3Diabetics 5 129

ME Non-diabetics 345 2Diabetics 7 130

CNN Non-diabetics 342 4Diabetics 10 128

MLPNN Non-diabetics 322 12Diabetics 30 120

Table 5: Confusion matrices of the classifiers

used for prediction of breast cancer

Classifiers Desired result Output result

Benignrecords

Malignantrecords

SVM Benign records 273 1Malignant records 1 158

RNN Benign records 271 3Malignant records 3 156

PNN Benign records 270 4Malignant records 4 155

ME Benign records 271 2Malignant records 3 157

CNN Benign records 268 5Malignant records 6 154

MLPNN Benign records 253 14Malignant records 21 145


architecture, their approach performed

very well compared to the previous studies.

5. Setiono (1996) used rule extraction from

an ANN algorithm to extract useful rules

that can predict breast cancer from the

Wisconsin data set (Wolberg &Mangasar-

ian, 1990). He needed first to train an

ANN using backpropagation and

achieved an accuracy level on the test data

of approximately 94%. After applying the

rule extraction technique, the accuracy of

the extracted rule set did not change.

Setiono (2000) used feature selection be-

fore training the ANN. The new rule sets

had an average accuracy of more than

96%. This is an improvement compared

to the initial results.

6. Furundzic et al. (1998) presented another

backpropagation ANN attempt where

they used 47 input features, and after the

use of some heuristics to determine the

number of hidden units, they used five

hidden units. With 200 instances and after

a significant amount of feature selection,

they reduced the number of input features

to 29 while maintaining the same classifi-

cation accuracy.

7. Pendharkar et al. (1999) presented a com-

parison between data envelopment analy-

sis and ANNs. They found that the ANN

approach was significantly better than the

data envelopment analysis approach, with

around 25% improvement in classification

accuracy.

8. Abbass (2002) presented an evolutionary

ANN approach based on the Pareto differ-

ential evolution algorithm augmented with

local search for the prediction of breast

cancer. The study showed empirically that

the proposed approach had better general-

ization than previous approaches, with

much lower computational cost. The aver-

age accuracy obtained for the breast can-

cer data set was 98.1%.

9. The result of a study by Shanker (1996)

that used neural networks to predict the

onset of diabetes in Pima Indian women

showed that the neural network is a viable

approach to classification.

10. Park and Edington (2001) presented an

approach that uses a sequential MLPNN

with backpropagation learning and an ex-

plicit model of time-varying inputs along

with the sequentially obtained prediction

probability, which was obtained by em-

bedding a multivariate logistic function

for consecutive years. The approach out-

performed the baseline classification and

regression models in terms of sensitivity

(86.04%) for test data.

11. The results of the present study indicated

excellent performance of the SVMs on the

classification of the Pima Indians diabetes

database (total classification accuracy

99.59%) and the Wisconsin breast cancer

database (total classification accuracy

99.54%).

6. Conclusion

The purpose of the present research was to

investigate the accuracy of five types of auto-

Table 6: The classification accuracies of the classifiers

Classifier Diabetes Breast cancer

Specificity(%)

Sensitivity(%)

Total classificationaccuracy (%)

Specificity(%)

Sensitivity(%)

Total classificationaccuracy (%)

SVM 99.72 99.24 99.59 99.64 99.37 99.54RNN 98.30 97.73 98.14 98.91 98.11 98.61PNN 98.58 97.73 98.35 98.54 97.48 98.15ME 98.01 98.48 98.14 98.91 98.74 98.85CNN 97.16 96.97 97.11 97.81 96.86 97.46MLPNN 91.48 90.91 91.32 92.34 91.19 91.92


mated diagnostic systems, namely CNNs, ME,

PNNs, RNNs and SVMs, for clinical decision-

making. The performance of these classifiers

was then compared together and with that of

the MLPNN. These classifiers were trained on

the attributes of each record in the Pima Indians

diabetes database and the Wisconsin breast

cancer database. The classification results and

the values of statistical parameters indicated

that the SVMs had considerable success. The

SVM classifiers showed a great performance

since they map the features to a higher dimen-

sional space. Beside this, the RNN, PNN, ME

and CNN classifiers provided encouraging re-

sults. The performance of the MLPNN was not

as high as the other classifiers. This may be

attributed to several factors including the train-

ing algorithms, estimation of the network para-

meters and the scattered and mixed nature of the

features. The results obtained confirmed the

validity of the classifiers for clinical decision-

making.

References

ABBASS, H.A. (2002) An evolutionary artificial neuralnetworks approach for breast cancer diagnosis, Arti-ficial Intelligence in Medicine, 25, 265–281.

BASHEER, I.A. and M. HAJMEER (2000) Artificialneural networks: fundamentals, computing, design,and application, Journal ofMicrobiological Methods,43 (1), 3–31.

BATTITI, R. (1992) First- and second-order methods forlearning: between steepest descent and Newton’smethod, Neural Computation, 4, 141–166.

BESSER, G.M., H.J. BODANSKY and A.G. CUDWORTH

(1988) Clinical Diabetes: an Illustrated Text,London: Gower Medical.

BURRASCANO, P. (1991) Learning vector quantizationfor the probabilistic neural network, IEEE Transac-tions on Neural Networks, 2 (4), 458–461.

CHAUDHURI, B.B. and U. BHATTACHARYA (2000)Efficient training and improved performance ofmultilayer perceptron in pattern classification, Neu-rocomputing, 34, 11–27.

CHEN, K., L. XU and H. CHI (1999) Improved learningalgorithms for mixture of experts in multiclass clas-sification, Neural Networks, 12 (9), 1229–1252.

CORTES, C. and V. VAPNIK (1995) Support vectornetworks, Machine Learning, 20 (3), 273–297.

ELMAN, J.L. (1990) Finding structure in time,CognitiveScience, 14 (2), 179–211.

FLOYD, C.E., J.Y. LO, A.J. YUN, D.C. SULLIVAN andP.J. KORNGUTH (1994) Prediction of breast cancermalignancy using an artificial neural network, Can-cer, 74, 2944–2998.

FOGEL, D.B., E.C. WASSON and E.M. BOUGHTON

(1995) Evolving neural networks for detecting breastcancer, Cancer Letters, 96 (1), 49–53.

FURUNDZIC, D., M. DJORDJEVIC and A.J. BEKIC

(1998) Neural networks approach to early breastcancer detection, Journal of Systems Architecture,44 (8), 617–633.

GULER, I. and E.D. UBEYLI (2003) Detection ofophthalmic artery stenosis by least-mean squaresbackpropagation neural network, Computers in Biol-ogy and Medicine, 33 (4), 333–343.

GULER, I. and E.D. UBEYLI (2005a) ECG beat classi-fier designed by combined neural network model,Pattern Recognition, 38 (2), 199–208.

GULER, I. and E.D. UBEYLI (2005b) A mixture ofexperts network structure for modelling Dopplerultrasound blood flow signals, Computers in Biologyand Medicine, 35 (7), 565–582.

HAGAN, M.T. and M.B. MENHAJ (1994) Trainingfeedforward networks with the Marquardt algo-rithm, IEEE Transactions on Neural Networks, 5

(6), 989–993.HAYKIN, S. (1994) Neural Networks: A Comprehensive

Foundation, New York: Macmillan.HONG, X. and C.J. HARRIS (2002) A mixture of experts

network structure construction algorithm formodelling and control, Applied Intelligence, 16 (1),59–69.

ITCHHAPORIA, D., P.B. SNOW, R.J. ALMASSY and W.J.OETGEN (1996) Artificial neural networks: currentstatus in cardiovascular medicine, Journal of theAmerican College of Cardiology, 28 (2), 515–521.

JACOBS, R.A., M.I. JORDAN, S.J. NOWLAN and G.E.HINTON (1991) Adaptive mixtures of local experts,Neural Computation, 3 (1), 79–87.

JEREZ-ARAGONES, J.M., J.A. GOMEZ-RUIZ, G. RA-

MOS-JIMENEZ, J. MUNOZ-PEREZ and E. ALBA-CON-

EJO (2003) A combined neural network and decisiontrees model for prognosis of breast cancer relapse,Artificial Intelligence in Medicine, 27 (1), 45–63.

JORDAN, M.I. and R.A. JACOBS (1994) Hierarchicalmixture of experts and the EM algorithm, NeuralComputation, 6 (2), 181–214.

KORDYLEWSKI, H., D. GRAUPE and K. LIU (2001) Anovel large-memory neural network as an aid inmedical diagnosis applications, IEEE Transactionson Information Technology in Biomedicine, 5 (3),202–209.

KWAK, N. and C.-H. CHOI (2002) Input feature selec-tion for classification problems, IEEE Transactionson Neural Networks, 13 (1), 143–159.

LIM, C.P., R.F. HARRISON and R.L. KENNEDY (1997)Application of autonomous neural network systems


to medical pattern classification tasks, ArtificialIntelligence in Medicine, 11, 215–239.

MILLER, A.S., B.H. BLOTT and T.K. HAMES (1992)Review of neural network applications in medicalimaging and signal processing, Medical and Biologi-cal Engineering and Computing, 30, 449–464.

MOBLEY, B.A., E. SCHECHTER, W.E. MOORE, P.A.MCKEE and J.E. EICHNER (2000) Predictions ofcoronary artery stenosis by artificial neural network,Artificial Intelligence in Medicine, 18, 187–203.

PARK, J. and D.W. EDINGTON (2001) A sequentialneural network model for diabetes prediction, Arti-ficial Intelligence in Medicine, 23, 277–293.

PENDHARKAR, P.C., J.A. RODGER, G.J. YAVERBAUM,N. HERMAN and M. BENNER (1999) Association,statistical, mathematical and neural approaches formining breast cancer patterns, Expert Systems withApplications, 17, 223–232.

PETROSIAN, A., D. PROKHOROV, R. HOMAN, R. DA-

SHEIFF and D.D. WUNSCH II (2000) Recurrentneural network based prediction of epileptic seizuresin intra- and extra-cranial EEG, Neurocomputing,30, 201–218.

PINEDA, F.J. (1987) Generalization of back-propaga-tion to recurrent neural networks, Physical ReviewLetters, 59 (19), 2229–2232.

SETIONO, R. (1996) Extracting rules from prunedneural networks for breast cancer diagnosis, Artifi-cial Intelligence in Medicine, 8 (1), 37–51.

SETIONO, R. (2000) Generating concise and accurateclassification rules for breast cancer diagnosis, Arti-ficial Intelligence in Medicine, 18 (3), 205–219.

SHANKER, M.S. (1996) Using neural networks to pre-dict the onset of diabetes mellitus, Journal of Chemi-cal Information and Computer Sciences, 36, 35–41.

SHIEH, J.-S., C.-F. CHOU, S.-J. HUANG andM.-C. KAO

(2004) Intracranial pressure model in intensive careunit using a simple recurrent neural network throughtime, Neurocomputing, 57, 239–256.

SPECHT, D.F. (1990) Probabilistic neural networks,Neural Networks, 3 (1), 109–118.

TAFEIT, E. and G. REIBNEGGER (1999) Artificial neuralnetworks in laboratory medicine and medical out-come prediction, Clinical Chemistry and LaboratoryMedicine, 37 (9), 845–853.

UBEYLI, E.D. and I. GULER (2003) Neural networkanalysis of internal carotid arterial Doppler signals:predictions of stenosis and occlusion, Expert Sys-tems with Applications, 25 (1), 1–13.

UBEYLI, E.D. and I. GULER (2005) Feature extractionfrom Doppler ultrasound signals for automateddiagnostic systems, Computers in Biology and Medi-cine, 35 (9), 735–764.

VAPNIK, V. (1995) The Nature of Statistical LearningTheory, New York: Springer.

WEST, D. and V. WEST (2000) Model selection for amedical diagnostic decision support system: a breastcancer detection case, Artificial Intelligence in Medi-cine, 20 (3), 183–204.

WILDING, P., M.A. MORGAN, A.E. GRYGOTIS, M.A.SHOFFNER and E.F. ROSATO (1994) Application ofbackpropagation neural networks to diagnosis ofbreast and ovarian cancer, Cancer Letters, 77, 145–153.

WOLBERG, W.H. and O.L. MANGASARIAN (1990)Multisurface method of pattern separation for med-ical diagnosis applied to breast cytology, Proceedingsof the National Academy of Sciences, 87, 9193–9196.

WOLPERT, D.H. (1992) Stacked generalization, NeuralNetworks, 5, 241–259.

WU, Y.Z., M.L. GIGER, K. DOI, C.J. VYBORNY, R.A.SCHMIDT and C.E. METZ (1993) Artificial neuralnetworks in mammography: application to decisionmaking in the diagnosis of breast cancer, Radiology,187, 81–87.

The author

Elif Derya Ubeyli

Elif Derya Ubeyli received her first degree and

an MSc in electronic engineering from Cukur-

ova University, Turkey, and her PhD in electro-

nics and computer technology from Gazi

University. She is an associate professor in the

Department of Electrical and Electronics Engi-

neering at TOBB Economics and Technology

University, Ankara. Her interest areas include

biomedical signal processing, neural networks

and artificial intelligence. She has written more

than 75 articles on biomedical engineering.


comparison of different classification algorithms in clinical decision-making

Documents